Title: CoolCAMs: Power-Efficient TCAMs for Forwarding Engines
1CoolCAMs Power-Efficient TCAMs for Forwarding
Engines
- Paper by
- Francis Zane, Girija Narlikar, Anindya Basu
- Bell Laboratories, Lucent Technologies
- Presented by Edward Spitznagel
2Outline
- Introduction
- TCAMs for Address Lookup
- Bit Selection Architecture
- Trie-based Table Partitioning
- Route Table Updates
- Summary and Discussion
3Introduction
- Ternary Content-Addressable Memories (TCAMs) are
becoming very popular for designing
high-throughput forwarding engines they are - fast
- cost-effective
- simple to manage
- Major drawback high power consumption
- This paper presents architectures and algorithms
for making TCAM-based routing tables more
power-efficient
4TCAMs for Address Lookup
- Fully-associative memory, searchable in a single
cycle - Hardware compares query word (destination
address) to all stored words (routing prefixes)
in parallel - each bit of a stored word can be 0, 1, or X
(dont care) - in the event that multiple matches occur,
typically the entry with lowest address is
returned
5TCAMs for Address Lookup
- TCAM vendors now provide for a mechanism that can
reduce power consumption by selectively
addressing smaller portions of the TCAM - The TCAM is divided into a set of blocks each
block is a contiguous, fixed size chunk of TCAM
entries - e.g. a 512k entry TCAM could be divided into 64
blocks of 8k entries each - When a search command is issued, it is possible
to specify which block(s) to use in the search - This can help us save power, since the main
component of TCAM power consumption when
searching is proportional to the number of
searched entries
6Bit Selection Architecture
- Based on observation that most prefixes in core
routing tables are between 16 and 24 bits long - over 98, in the authors datasets
- Put the very short (lt16bit) and very long
(gt24bit) prefixes in a set of TCAM blocks to
search on every lookup - The remaining prefixes are partitioned into
buckets, one of which is selected by hashing
for each lookup - each bucket is laid out over one or more TCAM
blocks - In this paper, the hashing function is restricted
to merely using a selected set of input bits as
an index
7Bit Selection Architecture
8Bit Selection Architecture
- A route lookup, then, involves the following
- hashing function (bit selection logic, really)
selects k hashing bits from the destination
address, which identifies a bucket to be searched - also search the blocks with the very long and
very short prefixes - The main issues now are
- how to select the k hashing bits
- Restrict ourselves to choosing hashing bits from
the first 16 bits of the address, to avoid
replicating prefixes - how to allocate the different buckets among the
various TCAM blocks (since bucket size may not be
an integral multiple of the TCAM block size)
9Bit Selection Worst-case power consumption
- Given any routing table containing N prefixes,
each of length ? L , what is the size of the
largest bucket generated by the best possible
hash function that uses k bits out of the first
L? - Theorem III.1 There exists some hash function
splitting the setof prefixes such that the size
of the largest bucket is at most - more details and proof in Appendix I
- ideal hash function would generate 2k equal-sized
buckets - e.g. if k3, then each has size 0.125 if k6,
then each has size 0.015625
10Bit Selection Heuristics
- We dont expect to see the worst-case input, but
it gives designers a power budget - Given such a power budget and a routing table, it
is sufficient to find a set of hashing bits that
produce a split that does not exceed the power
budget (a satisfying split ) - 3 Heuristics
- the first is simple use the rightmost k bits of
the first 16 bits. In almost all routing traces
studied, this works well.
11Bit Selection Heuristics
- Second Heuristic brute force search to check all
possible subsets of k bits from the first 16. - Guaranteed to find a satisfying split
- Since it compares possible sets of k bits,
running time is maximum for k 8
12Bit Selection Heuristics
- Third heuristic a greedy algorithm
- Falls between the simple heuristic and the
brute-force one, in terms of complexity and
accuracy - To select k hashing bits, the algorithm performs
k iterations, selecting one bit per iteration - number of buckets doubles each iteration
- Goal in each iteration is to select a bit that
minimizes the size of the biggest bucket produced
in that iteration
13Bit Selection Heuristics
- Third heuristic greedy algorithm pseudocode
14Bit Selection Heuristics
- Combining the heuristics, to reduce running time
(in typical cases) - First, try the simple heuristic (use k rightmost
bits), and stop if that succeeds. - Otherwise, apply the third heuristic (greedy
algorithm), and stop if that succeeds. - Otherwise, apply the brute-force heuristic
- Apply algorithm again whenever route updates
cause any bucket to become too large.
15Bit Selection Architecture Experimental Results
- Evaluate the heuristics with respect to two
metrics running time and quality of splits
produced. - Applied to real core routing tables results are
presented for two, but others were similar - Applied to synthetic table with 1M entries,
constructed by randomly picking how many prefixes
share each combination of first 16 bits
16Bit Selection Results Running Time
- Running time on 800MHz PC
- Required less than 1MB memory
17Bit Selection Results Quality of Splits
- let N denote the number of 16-24 bit prefixes
- let cmax denote the maximum bucket size
- The ratio N / cmax measures the
quality(evenness) of thesplit produced bythe
hashing bits - it is the factor ofreduction in theportion of
the TCAMthat needs to besearched
18Bit Selection Architecture Laying out of TCAM
buckets
- Blocks for very long prefixes and very short
prefixes are placed in the TCAM at the beginning
and end, respectively. - Ensures that we select the longest prefix, if
more than one should match. - Laying out buckets sequentially in any order
- any bucket of size c occupies no more than
TCAM blocks,where s is the TCAM block
size - At most TCAM blocks need to be
searched for any lookup(plus the blocks for very
long and very short prefixes) - Thus, actual power savings ratio is not quite as
good as the N / cmax mentioned before, but it is
still good.
19Bit Selection Architecture Remarks
- Good average-case power reduction, but the
worst-case bounds are not as good - hardware designers are thus forced to design for
much higher power consumption than will be seen
in practice - Assumes most prefixes are 16-24 bits long
- may not always be the case (e.g. number of long
(gt24bit) prefixes may increase in the future)
20Trie-based Table Partitioning
- Partitioning scheme using a Routing Trie data
structure - Eliminates the two drawbacks of the Bit Selection
architecture - worst-case bounds on power consumption do not
match well with power consumption in practice - assumption that most prefixes are 16-24 bits long
- Two trie-based schemes (subtree-split and
postorder-splitting), both involving two steps - construct a binary routing trie from the routing
table - partitioning step carve out subtrees from the
trie and place into buckets - The two schemes differ in their partitioning step
21Trie-based Architecture
- Trie-based forwarding engine architecture
- use an index TCAM (instead of hashing) to
determine which bucket to search - requires searching the entire index TCAM, but
typically the index TCAM is very small
22Overview of Routing Tries
- A 1-bit trie can be used for performing longest
prefix matches - the trie consists of nodes, where a routing
prefix of length n is stored at level n of the
trie - Routing lookup process
- starts at the root
- scans input and descends the left (right) if the
next bit of input is 0 (1), until a leaf node is
reached - last prefix encountered is the longest matching
prefix - count(v ) number of routing prefixes in the
subtree rooted at v - the covering prefix of a node u is the prefix of
the lowest common ancestor of u that is in the
routing table (including u itself)
23Routing Trie Example
Routing Table
Corresponding 1-bit trie
24Splitting into subtrees
- Subtree-split algorithm
- input b maximum size of a TCAM bucket
- output a set of K ? TCAM buckets,
each with size inthe range , and an
index TCAM of size K - Partitioning step post order traversal of the
trie, looking for carving nodes. - Carving node a node with count ? and
with a parent whose count is gt b - When we find a carving node v ,
- carve out the subtree rooted at v, and place it
in a separate bucket - place the prefix of v in the index TCAM, along
with the covering prefix of v - counts of all ancestors of v are decreased by
count(v )
25Subtree-split Algorithm
26Subtree-split Example
b 4
27Subtree-split Example
b 4
28Subtree-split Example
b 4
29Subtree-split Example
b 4
30Subtree-split Remarks
- Subtree-split creates buckets whose size range
from ?b/2? to b (except the last, which ranges
from 1 to b ) - At most one covering prefix is added to each
bucket - The total number of buckets created ranges from
?N/b? to ?2N/b? each bucket results in one
entry in the index TCAM - Using subtree-split in a TCAM with K buckets,
during any lookup at most K ?2N /K? prefixes
are searched from the index and data TCAMs - Total complexity of the subtree-split algorithm
isO(N NW /b)
31Post-order splitting
- Partitions the table into buckets of exactly b
prefixes - improvement over subtree-split, where the
smallest and largest bucket sizes can vary by a
factor of 2 - this comes with the cost of more entries in the
index TCAM - Partitioning step post-order traversal of the
trie, looking for subtrees to carve out, but, - Buckets are made from collections of subtrees,
rather than just a single subtree - This is because it is possible the entire trie
does not contain?N /b? subtrees of exactly b
prefixes each
32Post-order splitting
- postorder-split does a post-order traversal of
the trie, calling carve-exact to carve out
subtree collections of size b - carve-exact does the actual carving
- if its at a node with count b , then it can
simply carve out that subtree - if its at a node with count lt b , whose parent
has count ? b , do nothing (since we will later
have a chance to carve the parent) - if its at a node with count x, where x lt b , and
the nodes parent has count gt b , then - carve out the subtree of size x at this node,
and - recursively call carve-exact again, this time
looking for a carving of size b - x (instead of b)
33Post-order split Algorithm
34Post-order split Example
b 4
35Post-order split Example
b 4
36Post-order split Example
b 4
37Postorder-split Remarks
- Postorder-split creates buckets of size b (except
the last, which ranges from 1 to b ) - At most W covering prefixes are added to each
bucket, where W is the length of the longest
prefix in the table - The total number of buckets created is exactly
?N/b?. Each bucket results in at most W 1
entries in the index TCAM - Using postorder-split in a TCAM with K buckets,
during any lookup at most (W 1)K ?N /K?W
prefixes are searched from the index and data
TCAMs - Total complexity of the postorder-split algorithm
isO(N NW /b)
38Post-order split Experimental results
39Post-order split Experimental results
- Reduction in routing table entries searched
40Route Table Updates
- Briefly explore performance in the face of
routing table updates - Adding routes may cause a TCAM bucket to
overflow, requiring repartitioning of the
prefixes and rewriting the entire table into the
TCAM - Apply real-life update traces (about 3.5M updates
each) to the bit-selection and trie-based
schemes, to see how often recomputation is needed
41Route Table Updates
- Bit-selection Architecture
- apply brute-force heuristic on the initial table
note size cmax of largest bucket - recompute hashing bits when any bucket grows
beyondcthresh (1 t ) x cmax for some
threshold t - when recomputing, first try the static heuristic
if needed, then try the greedy algorithm fall
back on brute-force if necessary - Trie-based Architecture
- similar threshold-based strategy
- subtree-split use bucket size of ?2N /K ?
- post-order splitting use bucket size of ?N /K ?
42Route Table Updates
- Results for bit-selection architecture
43Route Table Updates
- Results for trie-based architecture
44Route Table Updates post-opt algorithm
- Post-opt post-order split algorithm, with clever
handling of updates - with post-order split, we can transfer prefixes
between neighboring buckets easily (few writes to
the index and data TCAMs are needed) - so, if a bucket becomes overfull, we can usually
just transfer one of its prefixes to a
neighboring bucket - repartitioning, then, is only needed when both
neighboring buckets are also full
45Summary
- TCAMs would be great for routing lookup, if they
didnt use so much power - CoolCAMs two architectures that use partitioned
TCAMs to reduce power consumption in routing
lookup - Bit-selection Architecture
- Trie-based Table Partitioning (subtree-split and
postorder-splitting) - each scheme has its own subtle advantages/disadvan
tages, but overall they seem to work well
46Discussion