CoolCAMs: Power-Efficient TCAMs for Forwarding Engines - PowerPoint PPT Presentation

About This Presentation
Title:

CoolCAMs: Power-Efficient TCAMs for Forwarding Engines

Description:

... expect to see the worst-case input, but it gives designers a power budget ... Buckets are made from collections of subtrees, rather than just a single subtree ... – PowerPoint PPT presentation

Number of Views:158
Avg rating:3.0/5.0
Slides: 47
Provided by: edwardwsp
Category:

less

Transcript and Presenter's Notes

Title: CoolCAMs: Power-Efficient TCAMs for Forwarding Engines


1
CoolCAMs Power-Efficient TCAMs for Forwarding
Engines
  • Paper by
  • Francis Zane, Girija Narlikar, Anindya Basu
  • Bell Laboratories, Lucent Technologies
  • Presented by Edward Spitznagel

2
Outline
  • Introduction
  • TCAMs for Address Lookup
  • Bit Selection Architecture
  • Trie-based Table Partitioning
  • Route Table Updates
  • Summary and Discussion

3
Introduction
  • Ternary Content-Addressable Memories (TCAMs) are
    becoming very popular for designing
    high-throughput forwarding engines they are
  • fast
  • cost-effective
  • simple to manage
  • Major drawback high power consumption
  • This paper presents architectures and algorithms
    for making TCAM-based routing tables more
    power-efficient

4
TCAMs for Address Lookup
  • Fully-associative memory, searchable in a single
    cycle
  • Hardware compares query word (destination
    address) to all stored words (routing prefixes)
    in parallel
  • each bit of a stored word can be 0, 1, or X
    (dont care)
  • in the event that multiple matches occur,
    typically the entry with lowest address is
    returned

5
TCAMs for Address Lookup
  • TCAM vendors now provide for a mechanism that can
    reduce power consumption by selectively
    addressing smaller portions of the TCAM
  • The TCAM is divided into a set of blocks each
    block is a contiguous, fixed size chunk of TCAM
    entries
  • e.g. a 512k entry TCAM could be divided into 64
    blocks of 8k entries each
  • When a search command is issued, it is possible
    to specify which block(s) to use in the search
  • This can help us save power, since the main
    component of TCAM power consumption when
    searching is proportional to the number of
    searched entries

6
Bit Selection Architecture
  • Based on observation that most prefixes in core
    routing tables are between 16 and 24 bits long
  • over 98, in the authors datasets
  • Put the very short (lt16bit) and very long
    (gt24bit) prefixes in a set of TCAM blocks to
    search on every lookup
  • The remaining prefixes are partitioned into
    buckets, one of which is selected by hashing
    for each lookup
  • each bucket is laid out over one or more TCAM
    blocks
  • In this paper, the hashing function is restricted
    to merely using a selected set of input bits as
    an index

7
Bit Selection Architecture
8
Bit Selection Architecture
  • A route lookup, then, involves the following
  • hashing function (bit selection logic, really)
    selects k hashing bits from the destination
    address, which identifies a bucket to be searched
  • also search the blocks with the very long and
    very short prefixes
  • The main issues now are
  • how to select the k hashing bits
  • Restrict ourselves to choosing hashing bits from
    the first 16 bits of the address, to avoid
    replicating prefixes
  • how to allocate the different buckets among the
    various TCAM blocks (since bucket size may not be
    an integral multiple of the TCAM block size)

9
Bit Selection Worst-case power consumption
  • Given any routing table containing N prefixes,
    each of length ? L , what is the size of the
    largest bucket generated by the best possible
    hash function that uses k bits out of the first
    L?
  • Theorem III.1 There exists some hash function
    splitting the setof prefixes such that the size
    of the largest bucket is at most
  • more details and proof in Appendix I
  • ideal hash function would generate 2k equal-sized
    buckets
  • e.g. if k3, then each has size 0.125 if k6,
    then each has size 0.015625

10
Bit Selection Heuristics
  • We dont expect to see the worst-case input, but
    it gives designers a power budget
  • Given such a power budget and a routing table, it
    is sufficient to find a set of hashing bits that
    produce a split that does not exceed the power
    budget (a satisfying split )
  • 3 Heuristics
  • the first is simple use the rightmost k bits of
    the first 16 bits. In almost all routing traces
    studied, this works well.

11
Bit Selection Heuristics
  • Second Heuristic brute force search to check all
    possible subsets of k bits from the first 16.
  • Guaranteed to find a satisfying split
  • Since it compares possible sets of k bits,
    running time is maximum for k 8

12
Bit Selection Heuristics
  • Third heuristic a greedy algorithm
  • Falls between the simple heuristic and the
    brute-force one, in terms of complexity and
    accuracy
  • To select k hashing bits, the algorithm performs
    k iterations, selecting one bit per iteration
  • number of buckets doubles each iteration
  • Goal in each iteration is to select a bit that
    minimizes the size of the biggest bucket produced
    in that iteration

13
Bit Selection Heuristics
  • Third heuristic greedy algorithm pseudocode

14
Bit Selection Heuristics
  • Combining the heuristics, to reduce running time
    (in typical cases)
  • First, try the simple heuristic (use k rightmost
    bits), and stop if that succeeds.
  • Otherwise, apply the third heuristic (greedy
    algorithm), and stop if that succeeds.
  • Otherwise, apply the brute-force heuristic
  • Apply algorithm again whenever route updates
    cause any bucket to become too large.

15
Bit Selection Architecture Experimental Results
  • Evaluate the heuristics with respect to two
    metrics running time and quality of splits
    produced.
  • Applied to real core routing tables results are
    presented for two, but others were similar
  • Applied to synthetic table with 1M entries,
    constructed by randomly picking how many prefixes
    share each combination of first 16 bits

16
Bit Selection Results Running Time
  • Running time on 800MHz PC
  • Required less than 1MB memory

17
Bit Selection Results Quality of Splits
  • let N denote the number of 16-24 bit prefixes
  • let cmax denote the maximum bucket size
  • The ratio N / cmax measures the
    quality(evenness) of thesplit produced bythe
    hashing bits
  • it is the factor ofreduction in theportion of
    the TCAMthat needs to besearched

18
Bit Selection Architecture Laying out of TCAM
buckets
  • Blocks for very long prefixes and very short
    prefixes are placed in the TCAM at the beginning
    and end, respectively.
  • Ensures that we select the longest prefix, if
    more than one should match.
  • Laying out buckets sequentially in any order
  • any bucket of size c occupies no more than
    TCAM blocks,where s is the TCAM block
    size
  • At most TCAM blocks need to be
    searched for any lookup(plus the blocks for very
    long and very short prefixes)
  • Thus, actual power savings ratio is not quite as
    good as the N / cmax mentioned before, but it is
    still good.

19
Bit Selection Architecture Remarks
  • Good average-case power reduction, but the
    worst-case bounds are not as good
  • hardware designers are thus forced to design for
    much higher power consumption than will be seen
    in practice
  • Assumes most prefixes are 16-24 bits long
  • may not always be the case (e.g. number of long
    (gt24bit) prefixes may increase in the future)

20
Trie-based Table Partitioning
  • Partitioning scheme using a Routing Trie data
    structure
  • Eliminates the two drawbacks of the Bit Selection
    architecture
  • worst-case bounds on power consumption do not
    match well with power consumption in practice
  • assumption that most prefixes are 16-24 bits long
  • Two trie-based schemes (subtree-split and
    postorder-splitting), both involving two steps
  • construct a binary routing trie from the routing
    table
  • partitioning step carve out subtrees from the
    trie and place into buckets
  • The two schemes differ in their partitioning step

21
Trie-based Architecture
  • Trie-based forwarding engine architecture
  • use an index TCAM (instead of hashing) to
    determine which bucket to search
  • requires searching the entire index TCAM, but
    typically the index TCAM is very small

22
Overview of Routing Tries
  • A 1-bit trie can be used for performing longest
    prefix matches
  • the trie consists of nodes, where a routing
    prefix of length n is stored at level n of the
    trie
  • Routing lookup process
  • starts at the root
  • scans input and descends the left (right) if the
    next bit of input is 0 (1), until a leaf node is
    reached
  • last prefix encountered is the longest matching
    prefix
  • count(v ) number of routing prefixes in the
    subtree rooted at v
  • the covering prefix of a node u is the prefix of
    the lowest common ancestor of u that is in the
    routing table (including u itself)

23
Routing Trie Example
Routing Table
Corresponding 1-bit trie
24
Splitting into subtrees
  • Subtree-split algorithm
  • input b maximum size of a TCAM bucket
  • output a set of K ? TCAM buckets,
    each with size inthe range , and an
    index TCAM of size K
  • Partitioning step post order traversal of the
    trie, looking for carving nodes.
  • Carving node a node with count ? and
    with a parent whose count is gt b
  • When we find a carving node v ,
  • carve out the subtree rooted at v, and place it
    in a separate bucket
  • place the prefix of v in the index TCAM, along
    with the covering prefix of v
  • counts of all ancestors of v are decreased by
    count(v )

25
Subtree-split Algorithm
26
Subtree-split Example
b 4
27
Subtree-split Example
b 4
28
Subtree-split Example
b 4
29
Subtree-split Example
b 4
30
Subtree-split Remarks
  • Subtree-split creates buckets whose size range
    from ?b/2? to b (except the last, which ranges
    from 1 to b )
  • At most one covering prefix is added to each
    bucket
  • The total number of buckets created ranges from
    ?N/b? to ?2N/b? each bucket results in one
    entry in the index TCAM
  • Using subtree-split in a TCAM with K buckets,
    during any lookup at most K ?2N /K? prefixes
    are searched from the index and data TCAMs
  • Total complexity of the subtree-split algorithm
    isO(N NW /b)

31
Post-order splitting
  • Partitions the table into buckets of exactly b
    prefixes
  • improvement over subtree-split, where the
    smallest and largest bucket sizes can vary by a
    factor of 2
  • this comes with the cost of more entries in the
    index TCAM
  • Partitioning step post-order traversal of the
    trie, looking for subtrees to carve out, but,
  • Buckets are made from collections of subtrees,
    rather than just a single subtree
  • This is because it is possible the entire trie
    does not contain?N /b? subtrees of exactly b
    prefixes each

32
Post-order splitting
  • postorder-split does a post-order traversal of
    the trie, calling carve-exact to carve out
    subtree collections of size b
  • carve-exact does the actual carving
  • if its at a node with count b , then it can
    simply carve out that subtree
  • if its at a node with count lt b , whose parent
    has count ? b , do nothing (since we will later
    have a chance to carve the parent)
  • if its at a node with count x, where x lt b , and
    the nodes parent has count gt b , then
  • carve out the subtree of size x at this node,
    and
  • recursively call carve-exact again, this time
    looking for a carving of size b - x (instead of b)

33
Post-order split Algorithm
34
Post-order split Example
b 4
35
Post-order split Example
b 4
36
Post-order split Example
b 4
37
Postorder-split Remarks
  • Postorder-split creates buckets of size b (except
    the last, which ranges from 1 to b )
  • At most W covering prefixes are added to each
    bucket, where W is the length of the longest
    prefix in the table
  • The total number of buckets created is exactly
    ?N/b?. Each bucket results in at most W 1
    entries in the index TCAM
  • Using postorder-split in a TCAM with K buckets,
    during any lookup at most (W 1)K ?N /K?W
    prefixes are searched from the index and data
    TCAMs
  • Total complexity of the postorder-split algorithm
    isO(N NW /b)

38
Post-order split Experimental results
  • Algorithmrunningtime

39
Post-order split Experimental results
  • Reduction in routing table entries searched

40
Route Table Updates
  • Briefly explore performance in the face of
    routing table updates
  • Adding routes may cause a TCAM bucket to
    overflow, requiring repartitioning of the
    prefixes and rewriting the entire table into the
    TCAM
  • Apply real-life update traces (about 3.5M updates
    each) to the bit-selection and trie-based
    schemes, to see how often recomputation is needed

41
Route Table Updates
  • Bit-selection Architecture
  • apply brute-force heuristic on the initial table
    note size cmax of largest bucket
  • recompute hashing bits when any bucket grows
    beyondcthresh (1 t ) x cmax for some
    threshold t
  • when recomputing, first try the static heuristic
    if needed, then try the greedy algorithm fall
    back on brute-force if necessary
  • Trie-based Architecture
  • similar threshold-based strategy
  • subtree-split use bucket size of ?2N /K ?
  • post-order splitting use bucket size of ?N /K ?

42
Route Table Updates
  • Results for bit-selection architecture

43
Route Table Updates
  • Results for trie-based architecture

44
Route Table Updates post-opt algorithm
  • Post-opt post-order split algorithm, with clever
    handling of updates
  • with post-order split, we can transfer prefixes
    between neighboring buckets easily (few writes to
    the index and data TCAMs are needed)
  • so, if a bucket becomes overfull, we can usually
    just transfer one of its prefixes to a
    neighboring bucket
  • repartitioning, then, is only needed when both
    neighboring buckets are also full

45
Summary
  • TCAMs would be great for routing lookup, if they
    didnt use so much power
  • CoolCAMs two architectures that use partitioned
    TCAMs to reduce power consumption in routing
    lookup
  • Bit-selection Architecture
  • Trie-based Table Partitioning (subtree-split and
    postorder-splitting)
  • each scheme has its own subtle advantages/disadvan
    tages, but overall they seem to work well

46
Discussion
Write a Comment
User Comments (0)
About PowerShow.com