Title: Longest%20Prefix%20Matching%20Trie-based%20Techniques
1Longest Prefix MatchingTrie-based Techniques
- CS 685 Network Algorithmics
- Spring 2006
2The Problem
- Given
- Database of prefixes with associated next hops,
say - 1000101? 128.44.2.3
- 01101100 ? 4.33.2.1
- 10001 ? 124.33.55.12
- 10 ? 151.63.10.111
- 01 ? 4.33.2.1
- 1000100101 ? 128.44.2.3
- Destination IP address, e.g. 120.16.8.211
- Find the longest matching prefix and its next
hop
3Constraints
- Handle 150,000 prefixes in database
- Complete lookup in minimum-sized (40-byte) packet
transmission time - OC-768 (40 Gbps) 8 nsec
- High degree of multiplexingpackets from 250,000
flows interleaved - Database updated every few milliseconds
- ? performance ? number of memory accesses
4Basic ("Unibit") Trie Approach
- Recursive data structure (a tree)
- Nodes represent prefixes in the database
- Root corresponds to prefix of length zero
- Node for prefix x has three fields
- 0 branch pointer to node for prefix x0 (if
present) - 1 branch pointer to node for prefix x1 (if
present) - Next hop info for x (if present)
Example Database a 0 ? x b 01000 ? y c
011 ? z d 1 ? w e 100 ? u f 1100 ?
z g 1101 ? u h 1110 ? z i 1111 ? x
5a 0 ? x b 01000 ? y c 011 ? z d 1 ?
w e 100 ? u f 1100 ? z g 1101 ? u h
1110 ? z i 1111 ? x
0
1
6Trie Search Algorithm
- typedef struct foo
- struct foo trie_0, trie_1
- NEXTHOPINFO trie_info
- TRIENODE
- NEXTHOPINFO best NULL
- TRIENODE np root
- unsigned int bit 0x80000000
while (np ! NULL) if (np-gttrie_info) best
np-gttrie_info // check next bit if
(addrbit) np np-gttrie_1 else np
np-gttrie_0 bit gtgt 1 return best
7Conserving Space
- Sparse database ? wasted space
- Long chains of trie nodes with only one non-NULL
pointer - Solution handle "one-way" branches with special
nodes - encode the bits corresponding to the missing
nodes using text strings
8a 0 ? x b 01000 ? y c 011 ? z d 1 ?
w e 100 ? u f 1100 ? z g 1101 ? u h
1110 ? z i 1111 ? x
0
1
9a 0 ? x b 01000 ? y c 011 ? z d 1 ?
w e 100 ? u f 1100 ? z g 1101 ? u h
1110 ? z i 1111 ? x
0
1
00
10Bigger Issue Slow!
- Computing one bit at a time is too slow
- Worst-case one memory access per bit (32
accesses!) - Solution compute n bits at a time
- n stride length
- Use n-bit chunks of addresses as index into array
in each trie node - How to handle prefixes which are not a multiple
of n in length? - Extend them, replicate entries as needed
- E.g. n3, 1 becomes 100, 101, 110, 111
11Extending Prefixes
Original Database a 0 ? x b 01000 ? y c
011 ? z d 1 ? w e 100 ? w f 1100 ?
z g 1101 ? u h 1110 ? z i 1111 ? x
Expanded Database a0 00 ? x a1 01 ? x b0
010000 ? y b1 010001 ? y c0 0110 ? z c1
0111? z d0 10 ? w d1 11 ? w e0 1000 ?
u e1 1001 ? u f 1100 ? z g 1101 ? u h
1110 ? z i 1111 ? x
Example stride length2
12Expanded Database a0 00 ? x a1 01 ? x b0
010000 ? y b1 010001 ? y c0 0110 ? z c1
0111? z d0 10 ? w d1 11 ? w e0 1000 ?
w e1 1001 ? w f 1100 ? z g 1101 ? u h
1110 ? z i 1111 ? x
Total cost 40 pointers (22 null) Max memory
accesses 3
13a 0 ? x b 01000 ? y c 011 ? z d 1 ?
w e 100 ? u f 1100 ? z g 1101 ? u h
1110 ? z i 1111 ? x
0
1
00
Total cost 46 pointers (21 null) Max memory
accesses 5
14Choosing Fixed Stride Lengths
- We are trading space for time
- Larger stride length ? fewer memory accesses
- Larger stride length ? more wasted space
- Use the largest stride length that will fit in
memory and complete required accesses within the
time budget
15Updating
- Insertion
- Keep a unibit version of the trie, with each node
labeled with longest matching prefix and its
length - To insert P, search for P, remembering last node,
until - Null pointer (not present), or
- Reach the last stride in P
- Expand P as needed to match stride length
- Overwrite any existing entries with length less
than P's - Deletion is similar
- Find entry for prefix to be deleted
- Remove its entry (from unibit copy also!)
- Expand any entries that were "covered" by the
deleted prefix
16Variable Stride Lengths
- It is not necessary that every node have the same
stride length - Reduce waste by allowing stride length to vary
per node - Actual stride length encoded in pointer to the
trie node - Nodes with fewer used pointers can have smaller
stride lengths
17Expanded Database a0 00 ? x a1 01 ? x b
01000 ? y c0 0110 ? z c1 0111? z d0 10
? w d1 11 ? w e 100 ? w f 1100 ? z g
1101 ? u h 1110 ? z i 1111 ? x
1 bit
2 bits
2 bits
1 bit
u
0
1
Total waste 16 pointers Max memory accesses
3 Note encoding stride length costs 2
bits/pointer
18Calculating Stride Lengths
- How to pick stride lengths?
- We have two variables to play with height and
stride length - Trie height determines lookup speed ? set max
height first - Call it h
- Then choose strides to minimize storage
- Define cost of trie T, C(T)
- If T is a single node, then number of array
locations in the node - Else number of array locations in root ?i
C(Ti), where Ti's are children of T - Straightforward recursive solution
- Root stride s results in y2s subtries T1, ... Ty
- For each possible s, recursively compute optimal
strides for C(Ti)'s using height limit h-1 - Choose root stride s to minimize total cost (2s
?i C(Ti))
19Calculating Stride Lengths
- Problem Expensive, repeated subproblems
- Solution (Srinivasan Varghese)
- Dynamic programming
- Observe that each subtree of a variable-stride
trie contains the set of prefixes as some subtree
of the original unibit trie - For each node of the unibit trie, compute optimal
stride and cost for that stride - Start at bottom (height 1), work up
- Determine optimal grouping of leaves in subtree
- Given subtree optimal costs, compute parent
optimal cost - This results in optimal stride length selections
for the given set of prefixes
200
1
00
21Alternative Method Level Compression
- LC-trie (Nilsson Karlsson '98) is a
variable-stride trie with no empty entries in
trie nodes - Procedure
- Select largest root stride that allows no empty
entries - Do this recursively down through the tree
- Disadvantage cannot control height precisely
22Stride 1
Stride 1
0
1
00
Stride 1
Stride 2
23Performance Comparisons
- MAE-East database (1997 snapshot)
- 40K prefixes
- "Unoptimized" multibit trie 2003 KB
- Optimal fixed-stride 737 KB, computed in 1 msec
- Height limit 4 (? 1 Gbps wire speed _at_ 80
nsec/access) - Optimized (SV) variable-stride 423 KB, computed
in 1.6 sec, Height limit 4 - LC-compressed
- Height 7
- 700 KB
24Lulea Compressed Tries
- Goals
- Minimize number of memory accesses
- Aggressively compress trie
- Goal so it can fit in SRAM (or even cache)
- Three-level trie with strides of 16, 8, 8
- 8 mem accesses typical
- Main Techniques
- Leaf-pushing
- Eliminate duplicate pointers from trie node
arrays - Efficient bit-counting using precomputation for
large bitmaps - Use of indices instead of full pointers for
next-hop info
251. Leaf-Pushing
- In general, a trie node entry has associated
- A pointer to a next trie node
- A prefix (i.e. pointer to next-hop info)
- Or both, or neither
- Observation we don't need to know about a prefix
pointer along the way until we reach a leaf - So "push" prefix pointers down to leaves
- Keep only one set of pointers per node
26Leaf-Pushing the Concept
Prefixes
27Expanded Database a0 00 ? x a1 01 ? x b0
010000 ? y b1 010001 ? y c0 0110 ? z c1
0111? z d0 10 ? w d1 11 ? w e0 1000 ?
u e1 1001 ? u f 1100 ? z g 1101 ? u h
1110 ? z i 1111 ? x
Before
Cost 40 pointers (22 wasted)
282. Removing Duplicate Pointers
- Leaf-pushing results in many consecutive
duplicate pointers - Would like to remove redundancy and store only
one copy in each node - Problem now we can't directly index into array
using address bits - Example k2, bits 01 index 1 needs to be
converted to index 0 somehow
292. Removing Duplicate Pointers
- Solution Add a bitmap one bit per original
entry - 1 indicates new value
- 0 indicates duplicate of previous value
- To convert index i, count 1's up to position i in
the bitmap, and subtract 1 - Example old index 1 ? new index 0
- old index 2 ? new index 1
u
00
u
01
w
10
w
11
30Bitmap for Duplicate Elimination
Prefixes
10000000000010001000010000000000000000001000000000
01000000000000100010000001000000110000000000000000
00
313. Efficient Bit-Counting
- Lulea first-level 16-bit stride ? 64K entries
- Impractical to count bits up to, say, entry 34578
on the fly! - Solution Precompute (P2a)
- Divide bitmap into chunks (say, 64 bits each)
- Store the number of 1 bits in each chunk in an
array B - Compute 1 bits up to bit k by
- chunkNum k gtgt 6
- posInChunk k 0x3f // k mod 64
- numOnes BchunkNum count1sInChunk(chunkNum,po
sInChunk) 1
32Bit-Counting Precomputation Example
index 35
Chunk Size 8 bits
1
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
1
0
0
0
0
0
0
0
1
0
1
0
1
0
3
3
6
7
9
Converted index 7 2 1 8
Cost 2 memory accesses (maybe less)
334. Efficient Pointer Representation
- Observation the number of different next-hop
pointers is limited - Each corresponds to an immediate neighbor of the
router - Most routers have at most a few dozen neighbors
- In some cases a router might have a few hundred
distinct next hops, even a thousand - Apply P7 avoid unnecessary generality
- Only a few bits (say 8-12) needed to distinguish
between actual next-hop possibilities - Store indices into table of next-hops info
- E.g., to support up to 1024 next hops 10 bits
- 40K prefixes ? 40K pointers ? 160KB _at_ 32 bits,
- vs 50KB _at_ 10 bits
34Other Lulea Tricks
- First level of trie uses two levels of
bit-counting array - First counts bits before the 64-bit chunk
- Second counts bits in the 16-bit word within
chunk - Second- and third-level trie nodes are laid out
differently depending on number of pointers in
them - Each node has 256 entries
- Categorized by number of pointers
- 1-8 "sparse" store 8-bit indices 8 16-bit
pointers (24B) - 9-64 "dense" like first level, but only one
bit-counting array (only six bits of count
needed) - 65-256 "very dense" like first level, with two
bit-counting arrays 4 64-bit chunks, 16 16-bit
words
35Lulea Performance Results
- 1997 MAE-East database
- 32K entries, 58K leaves, 56 different next hops
- Resulting Trie size 160KB
- Build time 99 msec
- Almost all lookups took lt 100 clock cycles
- (333MHz Pentium)
36Trie Bitmap(Eatherton, Dittia Varghese)
- Goal storage, speed comparable to Lulea plus
fast insertion - Main culprit in slow insertion is leaf-pushing
- So get rid of leaf-pushing
- Go back to storing node and prefix pointers
explicitly - Use the same compression bitmap trick on both
lists - Store next-hop information separately, only
retrieve at the end - Like leaf-pushing, only in the control plane!
- Use smaller strides to limit memory accesses to
one per trie node (Lulea requires at least two)
37Storing Prefixes Explicitly
- To avoid expansion/leaf pushing, we have to store
prefixes in the node explicitly - There are 2k1 1 possible prefixes of length k
- Store list of (unique) next hop pointers for each
prefix covered by this node - Use same bitmap/bit counting technique as Lulea
to find pointer index - Keep trie nodes small (stride 4 or less), exploit
hardware (P5) to do prefix matching, bit counting
38Example Root node, stride 3
0
a 0 ? x b 01000 ? y c 011 ? z d 1 ?
w e 100 ? u f 1100 ? z g 1101 ? u h
1110 ? z i 1111 ? x
1
0
000
0
x
1
1
001
0
w
0
00
010
1
z
0
01
011
0
u
0
10
100
0
0
11
101
0
0
000
110
1
0
001
111
0
0
010
1
011
1
100
0
101
to child nodes
0
110
0
111
39Tree Bitmap Results
- Insertions are as in simple multibit tries
- May cause complete revamp of trie node, but that
requires only one memory allocation - Performance comparable to Lulea, but insertion
much faster
40A Different Lookup Paradigm
- Can we use binary search to do longest-prefix
lookups? - Observe that each prefix corresponds to a range
of addresses - E.g. 204.198.76.0/24 covers the range
- 204.198.76.0 204.198.76.255
- Each prefix has two range endpoints
- N disjoint prefixes divide the entire space into
2N1 disjoint segments - By sorting range endpoints, and comparing to
address, can determine exact prefix match
41Prefixes as Ranges
42Binary Search on Ranges
- Store 2N endpoints in sorted order
- Including the full address range for
- Store two pointers for each entry
- "gt" entry next-hop info for addresses strictly
greater than that value - "" entry next-hop info for addresses equal to
that value
43Example 6-bit addresses
Example Database a 0 ? x b 01000 ? y c
011 ? z d 1 ? w e 100 ? u f 1100 ?
z g 1101 ? u h 1110 ? z i 1111 ? x
a 000000-011111 ? x b 010000-010001 ? y c
011000-011111 ? z d 100000-111111 ? w e
100000-100111 ? u f 110000-110011 ? z g
110100-110111 ? u h 111000-111011 ? z i
111100-111111 ? x
44Range Binary Search Results
- N prefixes can be searched in log2 N 1 steps
- Slow compared to multibit tries
- Insertion can also be expensive
- Memory-expensive
- Requires 2 full-size entries per prefix
- 40K prefixes, 32-bit addresses 320KB, not
counting next-hop info - Advantage no patent restrictions!
45Binary Search on Prefix LengthsWaldvogel, et al
- For same-length prefixes, a hash table gives fast
comparisons - But linear search on prefix lengths is too
expensive - Can we do a faster (binary) search on prefix
lengths? - Challenge how do we know whether to move "up" or
"down" in length on failure? - Solution include extra information to indicate
presence of a longer prefix that might match - These are called marker entries
- Each marker entry also contains best-matching
prefix for that node - When searching, remember best-matching prefix
when going "up" because of a marker, in case of
later failure
46Example Binary Search on Prefix Length
Prefix Lengths 1, 3, 4, 7
Example Database a 0 ? x b 01000 ? y c
011 ? z d 1 ? w e 100 ? u f 1100 ?
z g 1101 ? u h 1110 ? z i 1111 ? x
0
1
length 1
BMP
a,x
d,w
length 3
011
100
110M
111M
010M
BMP
c,z
e,u
d,w
d,w
a,x
length 4
1100
1101
1110
1111
0100M
BMP
f,z
g,u
h,z
i,x
a,x
length 5
01000
BMP
b,y
Example Search for address 011000 and 101000
47Binary Search on Prefix Length Performance
- Worst-case number of hash-table accesses 5
- However, most prefixes are 16 or 24 bits
- Arrange hash tables so these are handled in one
or two accesses - This technique is very scalable for larger
address lengths (e.g. 128 bits for IPv6) - Unibit Trie for IPv6 128 accesses!
48Memory Allocation for Compressed Schemes
- Problem when using a compressed scheme (like
Lulea), trie nodes are kept at minimal size - If a node grows (changes size), it must be
reallocated and copied over - As we have discussed, memory allocators can
perform very badly - Assume M is the size of the largest possible
request - Cannot guarantee more than 1/log2 M of memory
will be used! - E.g. if M32, 20 is max guaranteed utilization
- Router vendors cannot claim to support large
databases
49Memory Allocation for Compressed Schemes
- Solution Compaction
- Copy memory from one location to another
- General-purpose OS's avoid compaction!
- Reason very hard to find and update all pointers
to objects in the moved region - The good news
- Pointer usage is very constrained in IP lookup
algorithms - Most lookup structures are trees ? at most one
pointer to any node - By storing a "parent" pointer, can easily update
pointers as needed