CoolCAMs: Power-Efficient TCAMs for Forwarding Engines - PowerPoint PPT Presentation

About This Presentation

Title:

CoolCAMs: Power-Efficient TCAMs for Forwarding Engines

Description:

... expect to see the worst-case input, but it gives designers a power budget ... Buckets are made from collections of subtrees, rather than just a single subtree ... – PowerPoint PPT presentation

Number of Views:158

Avg rating:3.0/5.0

Slides: 47

Provided by: edwardwsp

Learn more at: https://www.arl.wustl.edu

Category:

more less

Transcript and Presenter's Notes

Title: CoolCAMs: Power-Efficient TCAMs for Forwarding Engines

1
CoolCAMs Power-Efficient TCAMs for Forwarding
Engines

Paper by
Francis Zane, Girija Narlikar, Anindya Basu
Bell Laboratories, Lucent Technologies
Presented by Edward Spitznagel

2
Outline

Introduction
TCAMs for Address Lookup
Bit Selection Architecture
Trie-based Table Partitioning
Route Table Updates
Summary and Discussion

3
Introduction

Ternary Content-Addressable Memories (TCAMs) are
becoming very popular for designing
high-throughput forwarding engines they are
fast
cost-effective
simple to manage
Major drawback high power consumption
This paper presents architectures and algorithms
for making TCAM-based routing tables more
power-efficient

4
TCAMs for Address Lookup

Fully-associative memory, searchable in a single
cycle
Hardware compares query word (destination
address) to all stored words (routing prefixes)
in parallel
each bit of a stored word can be 0, 1, or X
(dont care)
in the event that multiple matches occur,
typically the entry with lowest address is
returned

5
TCAMs for Address Lookup

TCAM vendors now provide for a mechanism that can
reduce power consumption by selectively
addressing smaller portions of the TCAM
The TCAM is divided into a set of blocks each
block is a contiguous, fixed size chunk of TCAM
entries
e.g. a 512k entry TCAM could be divided into 64
blocks of 8k entries each
When a search command is issued, it is possible
to specify which block(s) to use in the search
This can help us save power, since the main
component of TCAM power consumption when
searching is proportional to the number of
searched entries

6
Bit Selection Architecture

Based on observation that most prefixes in core
routing tables are between 16 and 24 bits long
over 98, in the authors datasets
Put the very short (lt16bit) and very long
(gt24bit) prefixes in a set of TCAM blocks to
search on every lookup
The remaining prefixes are partitioned into
buckets, one of which is selected by hashing
for each lookup
each bucket is laid out over one or more TCAM
blocks
In this paper, the hashing function is restricted
to merely using a selected set of input bits as
an index

7
Bit Selection Architecture
8
Bit Selection Architecture

A route lookup, then, involves the following
hashing function (bit selection logic, really)
selects k hashing bits from the destination
address, which identifies a bucket to be searched
also search the blocks with the very long and
very short prefixes
The main issues now are
how to select the k hashing bits
Restrict ourselves to choosing hashing bits from
the first 16 bits of the address, to avoid
replicating prefixes
how to allocate the different buckets among the
various TCAM blocks (since bucket size may not be
an integral multiple of the TCAM block size)

9
Bit Selection Worst-case power consumption

Given any routing table containing N prefixes,
each of length ? L , what is the size of the
largest bucket generated by the best possible
hash function that uses k bits out of the first
L?
Theorem III.1 There exists some hash function
splitting the setof prefixes such that the size
of the largest bucket is at most
more details and proof in Appendix I
ideal hash function would generate 2k equal-sized
buckets
e.g. if k3, then each has size 0.125 if k6,
then each has size 0.015625

10
Bit Selection Heuristics

We dont expect to see the worst-case input, but
it gives designers a power budget
Given such a power budget and a routing table, it
is sufficient to find a set of hashing bits that
produce a split that does not exceed the power
budget (a satisfying split )
3 Heuristics
the first is simple use the rightmost k bits of
the first 16 bits. In almost all routing traces
studied, this works well.

11
Bit Selection Heuristics

Second Heuristic brute force search to check all
possible subsets of k bits from the first 16.
Guaranteed to find a satisfying split
Since it compares possible sets of k bits,
running time is maximum for k 8

12
Bit Selection Heuristics

Third heuristic a greedy algorithm
Falls between the simple heuristic and the
brute-force one, in terms of complexity and
accuracy
To select k hashing bits, the algorithm performs
k iterations, selecting one bit per iteration
number of buckets doubles each iteration
Goal in each iteration is to select a bit that
minimizes the size of the biggest bucket produced
in that iteration

13
Bit Selection Heuristics

Third heuristic greedy algorithm pseudocode

14
Bit Selection Heuristics

Combining the heuristics, to reduce running time
(in typical cases)
First, try the simple heuristic (use k rightmost
bits), and stop if that succeeds.
Otherwise, apply the third heuristic (greedy
algorithm), and stop if that succeeds.
Otherwise, apply the brute-force heuristic
Apply algorithm again whenever route updates
cause any bucket to become too large.

15
Bit Selection Architecture Experimental Results

Evaluate the heuristics with respect to two
metrics running time and quality of splits
produced.
Applied to real core routing tables results are
presented for two, but others were similar
Applied to synthetic table with 1M entries,
constructed by randomly picking how many prefixes
share each combination of first 16 bits

16
Bit Selection Results Running Time

Running time on 800MHz PC
Required less than 1MB memory

17
Bit Selection Results Quality of Splits

let N denote the number of 16-24 bit prefixes
let cmax denote the maximum bucket size
The ratio N / cmax measures the
quality(evenness) of thesplit produced bythe
hashing bits
it is the factor ofreduction in theportion of
the TCAMthat needs to besearched

18
Bit Selection Architecture Laying out of TCAM
buckets

Blocks for very long prefixes and very short
prefixes are placed in the TCAM at the beginning
and end, respectively.
Ensures that we select the longest prefix, if
more than one should match.
Laying out buckets sequentially in any order
any bucket of size c occupies no more than
TCAM blocks,where s is the TCAM block
size
At most TCAM blocks need to be
searched for any lookup(plus the blocks for very
long and very short prefixes)
Thus, actual power savings ratio is not quite as
good as the N / cmax mentioned before, but it is
still good.

19
Bit Selection Architecture Remarks

Good average-case power reduction, but the
worst-case bounds are not as good
hardware designers are thus forced to design for
much higher power consumption than will be seen
in practice
Assumes most prefixes are 16-24 bits long
may not always be the case (e.g. number of long
(gt24bit) prefixes may increase in the future)

20
Trie-based Table Partitioning

Partitioning scheme using a Routing Trie data
structure
Eliminates the two drawbacks of the Bit Selection
architecture
worst-case bounds on power consumption do not
match well with power consumption in practice
assumption that most prefixes are 16-24 bits long
Two trie-based schemes (subtree-split and
postorder-splitting), both involving two steps
construct a binary routing trie from the routing
table
partitioning step carve out subtrees from the
trie and place into buckets
The two schemes differ in their partitioning step

21
Trie-based Architecture

Trie-based forwarding engine architecture
use an index TCAM (instead of hashing) to
determine which bucket to search
requires searching the entire index TCAM, but
typically the index TCAM is very small

22
Overview of Routing Tries

A 1-bit trie can be used for performing longest
prefix matches
the trie consists of nodes, where a routing
prefix of length n is stored at level n of the
trie
Routing lookup process
starts at the root
scans input and descends the left (right) if the
next bit of input is 0 (1), until a leaf node is
reached
last prefix encountered is the longest matching
prefix
count(v ) number of routing prefixes in the
subtree rooted at v
the covering prefix of a node u is the prefix of
the lowest common ancestor of u that is in the
routing table (including u itself)

23
Routing Trie Example
Routing Table
Corresponding 1-bit trie
24
Splitting into subtrees

Subtree-split algorithm
input b maximum size of a TCAM bucket
output a set of K ? TCAM buckets,
each with size inthe range , and an
index TCAM of size K
Partitioning step post order traversal of the
trie, looking for carving nodes.
Carving node a node with count ? and
with a parent whose count is gt b
When we find a carving node v ,
carve out the subtree rooted at v, and place it
in a separate bucket
place the prefix of v in the index TCAM, along
with the covering prefix of v
counts of all ancestors of v are decreased by
count(v )

25
Subtree-split Algorithm
26
Subtree-split Example
b 4
27
Subtree-split Example
b 4
28
Subtree-split Example
b 4
29
Subtree-split Example
b 4
30
Subtree-split Remarks

Subtree-split creates buckets whose size range
from ?b/2? to b (except the last, which ranges
from 1 to b )
At most one covering prefix is added to each
bucket
The total number of buckets created ranges from
?N/b? to ?2N/b? each bucket results in one
entry in the index TCAM
Using subtree-split in a TCAM with K buckets,
during any lookup at most K ?2N /K? prefixes
are searched from the index and data TCAMs
Total complexity of the subtree-split algorithm
isO(N NW /b)

31
Post-order splitting

Partitions the table into buckets of exactly b
prefixes
improvement over subtree-split, where the
smallest and largest bucket sizes can vary by a
factor of 2
this comes with the cost of more entries in the
index TCAM
Partitioning step post-order traversal of the
trie, looking for subtrees to carve out, but,
Buckets are made from collections of subtrees,
rather than just a single subtree
This is because it is possible the entire trie
does not contain?N /b? subtrees of exactly b
prefixes each

32
Post-order splitting

postorder-split does a post-order traversal of
the trie, calling carve-exact to carve out
subtree collections of size b
carve-exact does the actual carving
if its at a node with count b , then it can
simply carve out that subtree
if its at a node with count lt b , whose parent
has count ? b , do nothing (since we will later
have a chance to carve the parent)
if its at a node with count x, where x lt b , and
the nodes parent has count gt b , then
carve out the subtree of size x at this node,
and
recursively call carve-exact again, this time
looking for a carving of size b - x (instead of b)

33
Post-order split Algorithm
34
Post-order split Example
b 4
35
Post-order split Example
b 4
36
Post-order split Example
b 4
37
Postorder-split Remarks

Postorder-split creates buckets of size b (except
the last, which ranges from 1 to b )
At most W covering prefixes are added to each
bucket, where W is the length of the longest
prefix in the table
The total number of buckets created is exactly
?N/b?. Each bucket results in at most W 1
entries in the index TCAM
Using postorder-split in a TCAM with K buckets,
during any lookup at most (W 1)K ?N /K?W
prefixes are searched from the index and data
TCAMs
Total complexity of the postorder-split algorithm
isO(N NW /b)

38
Post-order split Experimental results

Algorithmrunningtime

39
Post-order split Experimental results

Reduction in routing table entries searched

40
Route Table Updates

Briefly explore performance in the face of
routing table updates
Adding routes may cause a TCAM bucket to
overflow, requiring repartitioning of the
prefixes and rewriting the entire table into the
TCAM
Apply real-life update traces (about 3.5M updates
each) to the bit-selection and trie-based
schemes, to see how often recomputation is needed

41
Route Table Updates

Bit-selection Architecture
apply brute-force heuristic on the initial table
note size cmax of largest bucket
recompute hashing bits when any bucket grows
beyondcthresh (1 t ) x cmax for some
threshold t
when recomputing, first try the static heuristic
if needed, then try the greedy algorithm fall
back on brute-force if necessary
Trie-based Architecture
similar threshold-based strategy
subtree-split use bucket size of ?2N /K ?
post-order splitting use bucket size of ?N /K ?

42
Route Table Updates

Results for bit-selection architecture

43
Route Table Updates

Results for trie-based architecture

44
Route Table Updates post-opt algorithm

Post-opt post-order split algorithm, with clever
handling of updates
with post-order split, we can transfer prefixes
between neighboring buckets easily (few writes to
the index and data TCAMs are needed)
so, if a bucket becomes overfull, we can usually
just transfer one of its prefixes to a
neighboring bucket
repartitioning, then, is only needed when both
neighboring buckets are also full

45
Summary

TCAMs would be great for routing lookup, if they
didnt use so much power
CoolCAMs two architectures that use partitioned
TCAMs to reduce power consumption in routing
lookup
Bit-selection Architecture
Trie-based Table Partitioning (subtree-split and
postorder-splitting)
each scheme has its own subtle advantages/disadvan
tages, but overall they seem to work well

46
Discussion

Write a Comment

User Comments (0)