Some Open Questions Related to Cuckoo Hashing - PowerPoint PPT Presentation

About This Presentation
Title:

Some Open Questions Related to Cuckoo Hashing

Description:

Could be on a long cycle. ... ( Reach empty bucket with inverse logarithmic probability.) Expansion gives such a bucket is found after O(log n) steps with high ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 59
Provided by: MichaelMit3
Category:

less

Transcript and Presenter's Notes

Title: Some Open Questions Related to Cuckoo Hashing


1
Some Open Questions Related toCuckoo Hashing
  • Michael Mitzenmacher
  • Harvard University

2
The Beginnings
3
Cuckoo Hashing
  • Basic scheme each element gets two possible
    locations (uniformly at random).
  • To insert x, check both locations for x. If one
    is empty, insert.
  • If both are full, x kicks out an old element y.
    Then y moves to its other location.
  • If that location is full, y kicks out z, and so
    on, until an empty slot is found.

4
Cuckoo Hashing Examples
A
B
C
E
D
5
Cuckoo Hashing Examples
A
B
C
F
E
D
6
Cuckoo Hashing Examples
A
B
F
C
E
D
7
Cuckoo Hashing Examples
A
B
F
C
G
E
D
8
Cuckoo Hashing Examples
E
G
B
F
C
A
D
9
Cuckoo Hashing Examples
A
B
C
G
E
D
F
10
Why Do We CareAbout Cuckoo Hashing?
  • Hash tables a fundamental data structure.
  • Multiple-choice hashing yields tables with
  • High memory utilization.
  • Constant time look-ups.
  • Simplicity easily coded, parallelized.
  • Cuckoo hashing expands on this, combining
    multiple choices with ability to move elements.
  • Practical potential, and theoretically
    interesting!

11
Good Properties of Cuckoo Hashing
  • Worst case constant lookup time.
  • High memory utilizations possible.
  • Simple to build, design.

12
Cuckoo Hashing Failures
  • Bad case 1 inserted element runs into cycles.
  • Bad case 2 inserted element has very long path
    before insertion completes.
  • Could be on a long cycle.
  • Bad cases occur with very small probability when
    load is sufficiently low.
  • Theoretical solution re-hash everything if a
    failure occurs.

13
Various Representations
Buckets
Buckets
Elements
Elements
Buckets
Buckets
Elements
14
Basic Performance
  • For 2 choices, load less than 50, n elements
    gives failure rate of Q(1/n) maximum insert time
    O(log n).
  • Related to random graph representation.
  • Each element is an edge, buckets are vertices.
  • Edge corresponds to two random choices of an
    element.
  • Small load implies small acyclic or unicyclic
    components, of size at most O(log n).

15
Natural Extensions
  • More than 2 choices per element.
  • Very different hypergraphs instead of graphs.
  • D. Fotakis, R. Pagh, P. Sanders, and P. Spirakis.
  • Space efficient hash tables with worst case
    constant access time.
  • More than 1 element per bucket.
  • M. Dietzfelbinger and C. Weidling.
  • Balanced allocation and dictionaries with tightly
    packed constant size bins.

16
Variations
  • Online Elements inserted and deleted as you go.
  • Constant expected time logarithmic (or
    polylogarithmic) time with high probability per
    element.
  • Offline All elements available at start.
  • Becomes a maximum matching problem.
  • No real moving of elements -- equivalent to
    offline version of multiple-choice hashing of
    Azar, Broder, Karlin, and Upfal.

17
Open Question 1Random Walk Cuckoo Hashing
  • More than 2 choices is important.
  • Much higher memory utilizations.
  • 3 choices 90 in experiments.
  • 4 choices about 97.
  • Analysis FPSS Use breadth first search on
    bipartite graph to find an augmenting path.
  • Not practical for many implementations.

18
Random Walk Cuckoo Hashing
  • When it is time to kick something out, choose one
    randomly.
  • Small state, effective.
  • Intuition if fraction p of the buckets are
    empty, random walk should have fraction p of
    finding empty bucket at each step.
  • Clearly wrong, but nice intuition.
  • Suggests logarithmic time to find an empty slot.

19
The Open Question
  • Find tight bounds on the performance of random
    walk cuckoo hashing in the online setting, for d
    gt 3 choices (and possibly more than one element
    per bucket).

20
Recent Progress
  • Polylogarithmic bounds on insertion time for
    large number of choices RANDOM 09, Frieze,
    Mitzenmacher, Melsted.
  • Two step argument
  • Most buckets have an augmenting path of length
    O(log log n) to an empty bucket. (Reach empty
    bucket with inverse logarithmic probability.)
  • Expansion gives such a bucket is found after
    O(log n) steps with high probability.

21
The Open Question
  • Find tight bounds on the performance of random
    walk cuckoo hashing in the online setting, for d
    gt 3 choices (and possibly more than one element
    per bucket).
  • Is logarithmic insertion time the right answer?
    Lower bounds?
  • Better understanding of graph structure with 3 or
    more choices.

22
Open Question 2Thresholds
  • How much load can cuckoo hashing handle before
    collisions overwhelm it?
  • There appear to be asymptotic thresholds.
  • Fine below the threshold, disaster after.
  • Useful for designs for real systems.
  • The case for 2 choices, 1 element per bucket well
    understood.
  • Less so for other cases.

23
The Open Question
  • Tight thresholds for cuckoo hashing schemes, and
    corresponding efficient algorithms.

24
Whats Known
  • 2 choices, 1 element per bucket well understood.
  • For 2 choices, more than 1 element per bucket
  • Corresponds to orientability problems on random
    graphs orient edges so no more than k pointing
    to each vertex.
  • Offline thresholds known.
  • Online (provable) thresholds weak.
  • For more than 2 choices
  • Harder orientability problems.
  • Online (provable) thresholds weak.
  • Very close lower/upper bounds for offline setting.

25
New Result
  • Dietzfelbinger, Goerdt, Mitzenmacher, Montanari,
    Pagh have tight bounds on offline thresholds,
    more than 2 choices, 1 item per bucket.
  • Extension to more than 1 item per bucket still
    open.
  • Writeup (hopefully) coming soon

26
What Was Known (Example)
  • Case of 3 choices.
  • Upper bound on load of 0.9183. Batu Berenbrink
    Cooper
  • Uses differential-equation based analysis of
    orientability threshold.
  • Lower bound of 0.8894 (offline). Dietzfelbinger
    Pagh
  • Random maximum matching problem.
  • Use random matrices with 3 ones per column to
    design dictionary schemes. Bound corresponds to
    full-rank threshold of such matrices.
  • Upper bound is tight, using better bound on
    full-rank threshold.

27
The Open Question
  • Tight thresholds for cuckoo hashing schemes, and
    corresponding efficient algorithms.
  • Offline bounds for more than 2 choices.
  • Offline bounds for more than 2 choices and more
    than 1 item per bucket.
  • Online bounds generally.
  • Specific case of d 3 especially interesting.

28
Open Question 3Stashes
  • A failure is declared whenever one element cant
    be placed.
  • Is that really necessary?
  • What if we could keep one element unplaced? Or
    eight? Or O(log n)?
  • Goal Reduce the failure probability.
  • Second goal Reduce moves per insert.

29
The Open Question
  • What is the value of some extra space to stash
    problematic elements?

30
Motivation CAMs
  • CAM content addressable memory
  • Fully associative lookup.
  • Usually expensive, so must be kept small.
  • Hardware solution, or a dedicated cache line in
    software.
  • Not usually considered in theoretical work, but
    very useful in practice.
  • Can we bridge this gap?
  • What can CAMs do for us?

31
A CAM-Stash
  • Use a CAM to stash away elements that would cause
    failure.
  • ESA 2008, Kirsch, Mitzenmacher, Wieder.
  • Intuition if failures were independent,
    probability that s elements cause failures goes
    to Q(1/ns).
  • Failures not independent, but nearly so.
  • A stash holding a constant number of elements
    greatly reduces failure probability.
  • Implemented as hardware CAM or cache line.
  • Lookup requires also looking at stash.
  • But generally empty.

32
Analysis Method
  • Treat cells as vertices, elements as edges in
    bipartite graph.
  • Count components that have excess edges to be
    placed in stash.
  • Random graph analysis to bound excess edges.

6 vertices, 7 edges 1 edge must go into stash.
33
A Simple Experiment
  • 10,000 elements, table of size 24,000, 2 choices
    per element, 107 trials.

Stash Size Needed Trials
0 9989861
1 10040
2 97
3 2
4 0
34
Generalizations
  • Can similarly generalize known results for cuckoo
    hashing with more than 2 choices, more than 1
    element per bucket.
  • Stash of size s reduces failure exponent linearly
    in s.
  • Intuition random graph analysis exposes
    bottleneck in cuckoo hashing. Stashes relieve
    the bottleneck.

35
CAM to Improve Insertion Time
  • Lots of moves per insert in worst case.
  • Average is constant.
  • But maximum is W(log n) with non-trivial
    (inverse-poly) probability.
  • May want bounded number of memory accesses per
    insert.
  • Empirical study by Kirsch/Mitzenmacher.

36
A CAM-Queue
  • Insertion is a sequence of suboperations.
  • Of form Move x to position Hj(x).
  • Use the CAM as a queue for pending suboperations.
  • Perform suboperations from queue as available.
  • Move attempt 1 lookup/write.
  • A suboperation may cause another suboperation to
    go on the queue.
  • Lookup check the hash table and the CAM-queue.
  • De-amortization
  • Use queue to turn worst-case performance into
    average-case performance.

37
Queue Sizes
  • Need CAM sized to overflow with negligible
    probability.
  • Maximum queue size much bigger than average.
  • Experiments suggest queues of size in small 10s
    possible, with 4 suboperations per insert, in
    practice.
  • Recent work by Arbitman, Naor, Segev gives
    provable bounds for logarithmic-sized queue for
    2-choice cuckoo hashing, up to 50 loads.
  • Analysis open for more than 2 choices.

38
The Open Question
  • What is the value of some extra space to stash
    problematic elements?
  • Can these uses of stashes be similarly useful for
    other data structures?
  • Is there a general theory telling us the value of
    constant/logarithmic/linear sized stashes?

39
Open Question 4Randomness
  • Analysis always easier when assuming hash
    functions are perfectly random.
  • But perfect hash functions are unrealistic.
  • What about real hash functions on real data?

40
The Open Question
  • How much randomness is needed for cuckoo hashing
    to be effective?

41
Universal Hash Families
  • Defined by Carter/Wegman
  • Family of hash functions L of form HN
    M is k-wise independent if when H is chosen
    randomly, for any distinct x1,x2,xk, and any
    a1,a2,ak,
  • Family is k-wise universal if

42
Recent Results
  • For 2 choices, O(log n)-wise independence is
    sufficient PR show hash functions of Siegel
    suffice.
  • Queueing result of ANS uses new technology of
    Braverman to show polylogarithmic-wise
    independence suffices.
  • Cohen and Kane show 5-independence not enough
    also show 1 O(log n)-wise independent and 1
    pairwise independent hash function suffice.

43
Another Approach Random Data
  • Previous analysis for worst-case data. What
    about random data?
  • Analysis usually trivial if data is
    independently, uniformly chosen over large
    universe.
  • Then all hashes appear perfectly random.
  • Not a good model for real data.
  • Need intermediate model between worst-case,
    average case. Mitzenmacher Vadhan

44
A Model for Data
  • Based on models of semi-random sources.
  • Data is a finite stream, modeled by a sequence of
    random variables X1,X2,XT.
  • Range of each variable is N.
  • Each stream element has some entropy, conditioned
    on values of previous elements.
  • Correlations possible.
  • But each new element has some unpredictability.

45
Applications
  • Potentially, wherever hashing is used
  • Bloom Filters
  • Power of Two Choices
  • Linear Probing
  • Cuckoo Hashing
  • Many Others

46
Intuition
  • If each element has entropy, then extract the
    entropy to hash each element to near-uniform
    location.
  • Extractors should provide near-uniform behavior.

47
Notions of Entropy
  • max probability
  • min-entropy
  • block source with max probability p per block
  • collision probability
  • Renyi entropy
  • block source with coll probability p per block
  • These entropies within a factor of 2.
  • We use collision probability/Renyi entropy.

48
Leftover Hash Lemma
  • A classical result (from 1989) ILL.
  • Intuitive statement If
    is chosen from a pairwise independent hash
    function, and X is a random variable with small
    collision probability, H(X) will be close to
    uniform.

49
Leftover Hash Lemma
  • Specific statements for current setting.
  • For 2-universal hash families.
  • Let be a random hash
    function from a 2-universal hash family L. If
    cp(X)lt 1/K, then (H,H(X)) is
    -close to (H,UM).
  • Equivalently, if X has Renyi entropy at least log
    M 2log(1/?), then (H,H(X)) is ?-close to
    uniform.
  • Let be a random hash
    function from a 2-universal hash family. Given a
    block-source with coll prob 1/K per block,
    (H,H(X1),.. H(XT)) is
    xxxxxxxxxx-close to (H,UMT).
  • Equivalently, if X has Renyi entropy at least log
    M 2log(T/?), then (H,H(X1),.. H(XT)) is ?-close
    to uniform.

50
Further Improvements
  • Additional improvements over Leftover Hash Lemma
    in paper MV08.
  • Chung and Vadhan CV08 further improve analysis.
  • Dietzfelbinger and Shellbach show you have to be
    careful pairwise independence not enough even
    for random data sets from small universe. DS09
  • Not enough entropy when data large compared to
    universe.

51
The Open Question
  • How much randomness is needed for cuckoo hashing
    to be effective?
  • Tighten bound on independence needed in worst
    case, and provide efficient hash function
    families.
  • What better results are possible with reasonable
    assumptions on the data?

52
Open Question 5Parallel Architectures
  • Multicores, Graphics Processor Units (GPUs),
    other parallel architectures possibly the next
    wave.
  • Multiple-choice hashing and cuckoo hashing seem
    naturally parallelizable.
  • Theory and practice?

53
The Open Question
  • Design and analyze efficient schemes for
    constructing and maintaining hash tables in
    modern parallel architectures.

54
Related Work
  • Plenty on parallel hashing/load balancing
    schemes.
  • PRAM emulation, related work in the 1990s.
  • Technical improvements of last decade suggest
    more is possible.
  • In Amenta et al., we designed new implementation
    for GPUs based on cuckoo hashing.
  • To appear in SIGGRAPH 09.
  • New theory, practical implementations possible?

55
The Open Question
  • Design and analyze efficient schemes for
    constructing and maintaining hash tables in
    modern parallel architectures.
  • How can cuckoo hashing be helpful?
  • Practical implementations, with strong
    theoretical backing?

56
Open Questions
  • Tight bounds on insertion times for random walk
    cuckoo hashing for d gt 2 choices.
  • Tight bounds on load capacity thresholds for
    cuckoo hashing for d gt 2 choices (and more than
    one element per bucket).
  • Stashes where to use them, and a general
    framework for them?
  • Randomness how much is really needed in the
    worst case? On suitably random data?
  • Parallelizable instantiations of cuckoo hashing?
  • Real-world applications for cuckoo hashing.
  • Your question here

57
Thanks
Much thanks to Martin Dietzfelbinger and Rasmus
Pagh for comments, suggestions,
references. Thanks to my co-authors for the
results.
58
THANK YOU.
Write a Comment
User Comments (0)
About PowerShow.com