Title: Some Open Questions Related to Cuckoo Hashing
1Some Open Questions Related toCuckoo Hashing
- Michael Mitzenmacher
- Harvard University
2The Beginnings
3Cuckoo Hashing
- Basic scheme each element gets two possible
locations (uniformly at random). - To insert x, check both locations for x. If one
is empty, insert. - If both are full, x kicks out an old element y.
Then y moves to its other location. - If that location is full, y kicks out z, and so
on, until an empty slot is found.
4Cuckoo Hashing Examples
A
B
C
E
D
5Cuckoo Hashing Examples
A
B
C
F
E
D
6Cuckoo Hashing Examples
A
B
F
C
E
D
7Cuckoo Hashing Examples
A
B
F
C
G
E
D
8Cuckoo Hashing Examples
E
G
B
F
C
A
D
9Cuckoo Hashing Examples
A
B
C
G
E
D
F
10Why Do We CareAbout Cuckoo Hashing?
- Hash tables a fundamental data structure.
- Multiple-choice hashing yields tables with
- High memory utilization.
- Constant time look-ups.
- Simplicity easily coded, parallelized.
- Cuckoo hashing expands on this, combining
multiple choices with ability to move elements. - Practical potential, and theoretically
interesting!
11Good Properties of Cuckoo Hashing
- Worst case constant lookup time.
- High memory utilizations possible.
- Simple to build, design.
12Cuckoo Hashing Failures
- Bad case 1 inserted element runs into cycles.
- Bad case 2 inserted element has very long path
before insertion completes. - Could be on a long cycle.
- Bad cases occur with very small probability when
load is sufficiently low. - Theoretical solution re-hash everything if a
failure occurs.
13Various Representations
Buckets
Buckets
Elements
Elements
Buckets
Buckets
Elements
14Basic Performance
- For 2 choices, load less than 50, n elements
gives failure rate of Q(1/n) maximum insert time
O(log n). - Related to random graph representation.
- Each element is an edge, buckets are vertices.
- Edge corresponds to two random choices of an
element. - Small load implies small acyclic or unicyclic
components, of size at most O(log n).
15Natural Extensions
- More than 2 choices per element.
- Very different hypergraphs instead of graphs.
- D. Fotakis, R. Pagh, P. Sanders, and P. Spirakis.
- Space efficient hash tables with worst case
constant access time. - More than 1 element per bucket.
- M. Dietzfelbinger and C. Weidling.
- Balanced allocation and dictionaries with tightly
packed constant size bins.
16Variations
- Online Elements inserted and deleted as you go.
- Constant expected time logarithmic (or
polylogarithmic) time with high probability per
element. - Offline All elements available at start.
- Becomes a maximum matching problem.
- No real moving of elements -- equivalent to
offline version of multiple-choice hashing of
Azar, Broder, Karlin, and Upfal.
17Open Question 1Random Walk Cuckoo Hashing
- More than 2 choices is important.
- Much higher memory utilizations.
- 3 choices 90 in experiments.
- 4 choices about 97.
- Analysis FPSS Use breadth first search on
bipartite graph to find an augmenting path. - Not practical for many implementations.
18Random Walk Cuckoo Hashing
- When it is time to kick something out, choose one
randomly. - Small state, effective.
- Intuition if fraction p of the buckets are
empty, random walk should have fraction p of
finding empty bucket at each step. - Clearly wrong, but nice intuition.
- Suggests logarithmic time to find an empty slot.
19The Open Question
- Find tight bounds on the performance of random
walk cuckoo hashing in the online setting, for d
gt 3 choices (and possibly more than one element
per bucket).
20Recent Progress
- Polylogarithmic bounds on insertion time for
large number of choices RANDOM 09, Frieze,
Mitzenmacher, Melsted. - Two step argument
- Most buckets have an augmenting path of length
O(log log n) to an empty bucket. (Reach empty
bucket with inverse logarithmic probability.) - Expansion gives such a bucket is found after
O(log n) steps with high probability.
21The Open Question
- Find tight bounds on the performance of random
walk cuckoo hashing in the online setting, for d
gt 3 choices (and possibly more than one element
per bucket). - Is logarithmic insertion time the right answer?
Lower bounds? - Better understanding of graph structure with 3 or
more choices.
22Open Question 2Thresholds
- How much load can cuckoo hashing handle before
collisions overwhelm it? - There appear to be asymptotic thresholds.
- Fine below the threshold, disaster after.
- Useful for designs for real systems.
- The case for 2 choices, 1 element per bucket well
understood. - Less so for other cases.
23The Open Question
- Tight thresholds for cuckoo hashing schemes, and
corresponding efficient algorithms.
24Whats Known
- 2 choices, 1 element per bucket well understood.
- For 2 choices, more than 1 element per bucket
- Corresponds to orientability problems on random
graphs orient edges so no more than k pointing
to each vertex. - Offline thresholds known.
- Online (provable) thresholds weak.
- For more than 2 choices
- Harder orientability problems.
- Online (provable) thresholds weak.
- Very close lower/upper bounds for offline setting.
25New Result
- Dietzfelbinger, Goerdt, Mitzenmacher, Montanari,
Pagh have tight bounds on offline thresholds,
more than 2 choices, 1 item per bucket. - Extension to more than 1 item per bucket still
open. - Writeup (hopefully) coming soon
26What Was Known (Example)
- Case of 3 choices.
- Upper bound on load of 0.9183. Batu Berenbrink
Cooper - Uses differential-equation based analysis of
orientability threshold. - Lower bound of 0.8894 (offline). Dietzfelbinger
Pagh - Random maximum matching problem.
- Use random matrices with 3 ones per column to
design dictionary schemes. Bound corresponds to
full-rank threshold of such matrices. - Upper bound is tight, using better bound on
full-rank threshold.
27The Open Question
- Tight thresholds for cuckoo hashing schemes, and
corresponding efficient algorithms. - Offline bounds for more than 2 choices.
- Offline bounds for more than 2 choices and more
than 1 item per bucket. - Online bounds generally.
- Specific case of d 3 especially interesting.
28Open Question 3Stashes
- A failure is declared whenever one element cant
be placed. - Is that really necessary?
- What if we could keep one element unplaced? Or
eight? Or O(log n)? - Goal Reduce the failure probability.
- Second goal Reduce moves per insert.
29The Open Question
- What is the value of some extra space to stash
problematic elements?
30Motivation CAMs
- CAM content addressable memory
- Fully associative lookup.
- Usually expensive, so must be kept small.
- Hardware solution, or a dedicated cache line in
software. - Not usually considered in theoretical work, but
very useful in practice. - Can we bridge this gap?
- What can CAMs do for us?
31A CAM-Stash
- Use a CAM to stash away elements that would cause
failure. - ESA 2008, Kirsch, Mitzenmacher, Wieder.
- Intuition if failures were independent,
probability that s elements cause failures goes
to Q(1/ns). - Failures not independent, but nearly so.
- A stash holding a constant number of elements
greatly reduces failure probability. - Implemented as hardware CAM or cache line.
- Lookup requires also looking at stash.
- But generally empty.
32Analysis Method
- Treat cells as vertices, elements as edges in
bipartite graph. - Count components that have excess edges to be
placed in stash. - Random graph analysis to bound excess edges.
6 vertices, 7 edges 1 edge must go into stash.
33A Simple Experiment
- 10,000 elements, table of size 24,000, 2 choices
per element, 107 trials.
Stash Size Needed Trials
0 9989861
1 10040
2 97
3 2
4 0
34Generalizations
- Can similarly generalize known results for cuckoo
hashing with more than 2 choices, more than 1
element per bucket. - Stash of size s reduces failure exponent linearly
in s. - Intuition random graph analysis exposes
bottleneck in cuckoo hashing. Stashes relieve
the bottleneck.
35CAM to Improve Insertion Time
- Lots of moves per insert in worst case.
- Average is constant.
- But maximum is W(log n) with non-trivial
(inverse-poly) probability. - May want bounded number of memory accesses per
insert. - Empirical study by Kirsch/Mitzenmacher.
36A CAM-Queue
- Insertion is a sequence of suboperations.
- Of form Move x to position Hj(x).
- Use the CAM as a queue for pending suboperations.
- Perform suboperations from queue as available.
- Move attempt 1 lookup/write.
- A suboperation may cause another suboperation to
go on the queue. - Lookup check the hash table and the CAM-queue.
- De-amortization
- Use queue to turn worst-case performance into
average-case performance.
37Queue Sizes
- Need CAM sized to overflow with negligible
probability. - Maximum queue size much bigger than average.
- Experiments suggest queues of size in small 10s
possible, with 4 suboperations per insert, in
practice. - Recent work by Arbitman, Naor, Segev gives
provable bounds for logarithmic-sized queue for
2-choice cuckoo hashing, up to 50 loads. - Analysis open for more than 2 choices.
38The Open Question
- What is the value of some extra space to stash
problematic elements? - Can these uses of stashes be similarly useful for
other data structures? - Is there a general theory telling us the value of
constant/logarithmic/linear sized stashes?
39Open Question 4Randomness
- Analysis always easier when assuming hash
functions are perfectly random. - But perfect hash functions are unrealistic.
- What about real hash functions on real data?
40The Open Question
- How much randomness is needed for cuckoo hashing
to be effective?
41Universal Hash Families
- Defined by Carter/Wegman
- Family of hash functions L of form HN
M is k-wise independent if when H is chosen
randomly, for any distinct x1,x2,xk, and any
a1,a2,ak, - Family is k-wise universal if
42Recent Results
- For 2 choices, O(log n)-wise independence is
sufficient PR show hash functions of Siegel
suffice. - Queueing result of ANS uses new technology of
Braverman to show polylogarithmic-wise
independence suffices. - Cohen and Kane show 5-independence not enough
also show 1 O(log n)-wise independent and 1
pairwise independent hash function suffice.
43Another Approach Random Data
- Previous analysis for worst-case data. What
about random data? - Analysis usually trivial if data is
independently, uniformly chosen over large
universe. - Then all hashes appear perfectly random.
- Not a good model for real data.
- Need intermediate model between worst-case,
average case. Mitzenmacher Vadhan
44 A Model for Data
- Based on models of semi-random sources.
- Data is a finite stream, modeled by a sequence of
random variables X1,X2,XT. - Range of each variable is N.
- Each stream element has some entropy, conditioned
on values of previous elements. - Correlations possible.
- But each new element has some unpredictability.
45Applications
- Potentially, wherever hashing is used
- Bloom Filters
- Power of Two Choices
- Linear Probing
- Cuckoo Hashing
- Many Others
46Intuition
- If each element has entropy, then extract the
entropy to hash each element to near-uniform
location. - Extractors should provide near-uniform behavior.
47Notions of Entropy
- max probability
- min-entropy
- block source with max probability p per block
- collision probability
- Renyi entropy
- block source with coll probability p per block
- These entropies within a factor of 2.
- We use collision probability/Renyi entropy.
48Leftover Hash Lemma
- A classical result (from 1989) ILL.
- Intuitive statement If
is chosen from a pairwise independent hash
function, and X is a random variable with small
collision probability, H(X) will be close to
uniform.
49Leftover Hash Lemma
- Specific statements for current setting.
- For 2-universal hash families.
- Let be a random hash
function from a 2-universal hash family L. If
cp(X)lt 1/K, then (H,H(X)) is
-close to (H,UM). - Equivalently, if X has Renyi entropy at least log
M 2log(1/?), then (H,H(X)) is ?-close to
uniform. - Let be a random hash
function from a 2-universal hash family. Given a
block-source with coll prob 1/K per block,
(H,H(X1),.. H(XT)) is
xxxxxxxxxx-close to (H,UMT). - Equivalently, if X has Renyi entropy at least log
M 2log(T/?), then (H,H(X1),.. H(XT)) is ?-close
to uniform.
50Further Improvements
- Additional improvements over Leftover Hash Lemma
in paper MV08. - Chung and Vadhan CV08 further improve analysis.
- Dietzfelbinger and Shellbach show you have to be
careful pairwise independence not enough even
for random data sets from small universe. DS09 - Not enough entropy when data large compared to
universe.
51The Open Question
- How much randomness is needed for cuckoo hashing
to be effective? - Tighten bound on independence needed in worst
case, and provide efficient hash function
families. - What better results are possible with reasonable
assumptions on the data?
52Open Question 5Parallel Architectures
- Multicores, Graphics Processor Units (GPUs),
other parallel architectures possibly the next
wave. - Multiple-choice hashing and cuckoo hashing seem
naturally parallelizable. - Theory and practice?
53The Open Question
- Design and analyze efficient schemes for
constructing and maintaining hash tables in
modern parallel architectures.
54Related Work
- Plenty on parallel hashing/load balancing
schemes. - PRAM emulation, related work in the 1990s.
- Technical improvements of last decade suggest
more is possible. - In Amenta et al., we designed new implementation
for GPUs based on cuckoo hashing. - To appear in SIGGRAPH 09.
- New theory, practical implementations possible?
55The Open Question
- Design and analyze efficient schemes for
constructing and maintaining hash tables in
modern parallel architectures. - How can cuckoo hashing be helpful?
- Practical implementations, with strong
theoretical backing?
56Open Questions
- Tight bounds on insertion times for random walk
cuckoo hashing for d gt 2 choices. - Tight bounds on load capacity thresholds for
cuckoo hashing for d gt 2 choices (and more than
one element per bucket). - Stashes where to use them, and a general
framework for them? - Randomness how much is really needed in the
worst case? On suitably random data? - Parallelizable instantiations of cuckoo hashing?
- Real-world applications for cuckoo hashing.
- Your question here
57Thanks
Much thanks to Martin Dietzfelbinger and Rasmus
Pagh for comments, suggestions,
references. Thanks to my co-authors for the
results.
58THANK YOU.