Dynamic Pipelining: Making IP-Lookup Truly Scalable - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Dynamic Pipelining: Making IP-Lookup Truly Scalable

Description:

Title: Buffer Aggregation Last modified by: Sailesh Kumar Created Date: 10/28/1995 2:49:22 PM Document presentation format: Letter Paper (8.5x11 in) – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 34

Provided by: arlWustl7

Category:

more less

Transcript and Presenter's Notes

Title: Dynamic Pipelining: Making IP-Lookup Truly Scalable

1
Dynamic Pipelining Making IP-Lookup Truly
Scalable

Jahangir Hasan T. N. Vijaykumar
Presented by Sailesh Kumar

2
A Simple router
At OC768, IP lookup needs to be carried out in 2
ns, can become a bottleneck
VOQs
Arriving Packets
IP Lookup
Crossbar

Routing table contains prefix, dest.
pairs IP-lookup finds dest. with longest
matching prefix
3
This Papers Contribution

This paper presents an IP lookup ASIC
architecture which addresses following 5
scalability challenges
Memory size - grow slowly with prefixes
Lookup throughput line rate
Implementation cost - complexity, chip area, etc
Power dissipation - grow slowly with prefixes
and line rate
Routing table update cost O(1)
No existing lookup architecture effectively
addresses all 5 challenges!

4
Previous work

Several IP lookup schemes proposed
Memory access time gt packet inter-arrival time
Must use pipelining
Several papers have proposed using pipelining

Space Throughput Updates Power Area
TCAMs Yes Yes Yes
HLP Varghese et al ISCA03 Yes Yes
DLP Basu, Narlikar - Infocom05 Yes Yes
This paper Yes Yes Yes Yes Yes
5
IP Address Lookup

Routing tables at router input ports contain
(prefix, next hop) pairs
Address in packet is compared to stored prefixes,
starting at left.
Prefix that matches largest number of address
bits is desired match.
Packet is forwarded to the specified next hop.

routing table
nexthop
prefix
10
7
01
5
110
3
1011
5
0001
0
0101 1
7
0001 0
1
0011 00
2
1011 001
3
1011 010
5
0100 110
6
0100 1100
4
1011 0011
8
1001 1000
10
0101 1001
9
Taken from CSE 577 Lecture Notes
address 1011 0010 1000
6
Address Lookup Using Tries

Prefixes stored in alphabetical order in tree.
Prefixes spelled out by following path from
top.
green dots mark prefix ends
To find best prefix, spell out address in tree.
Last green dot marks longest matching prefix.

address 1011 0010 1000
1
0
1
0
0
1
0
0
0
1
1
0
0
1
1
1
1
1
0
0
0
1
1
1
0
1
0
0
0
1
1
0
0
0
0
3
0
0
1
1
7
Leaf Pushing
Leaf Pushing, push P2 to all leaves
routing table
nexthop
1
0
prefix
P2
0
P1
P1
0
1
1
P2
101
P3
1
0
P3
Every Internal node might need to store the next
hop information
Complicates the updates, as all leaves needs to
be updated
Leaf Pushing avoids using longest prefix
matching, also reduces the node size with proper
encoding
8
Multibit Trie
address 101 100 101 000

Match several bits in one step instead of single
bit.
equivalent to turning sub-trees of binary trie
into single nodes.
Each node may be associated with several
prefixes.
For stride of s, reduces tree depth by factor of
s.

9
Controlled Prefix Expansion
There are schemes, which uses variable strides to
improve average case, but worst-case remains the
same
routing table
nexthop
prefix
0
P1
1
P2
101
P3
Stride 2, multibit trie
Controlled prefix expansion to align the stride
boundaries
In worst-case, controlled prefix expansion causes
non-deterministic increases in the routing table
size
10
Need for Pipelined Tries

Tomorrows routers will run at 160 Gbps, 2 ns per
packet
At most one memory access / 2 ns (may be less)
Moreover there may be millions of prefixes
In worst-case, memory requirements will be very
high
Memory will be slower
Needs an architecture which
Uses multiple smaller memories
Accesses them in a pipelined manner

11
Pipelined Trie-based IP-lookup
Tree data-structure, prefixes in leaves (leaf
pushing) Process IP address level-by-level to
find the longest match
1
0
1
0
P4 10010
0
1
1
0
P6
P7
P1
P2
P4
P5
P3

Each level in different stage ? overlap multiple
packets

12
Closest Previous Work
Data Structure Level Pipelining (DLP) - level to
stage mapping

Maps trie level to stage but this is a static
mapping
Updates change prefix distribution but mapping
persists

0 00 000 ..
P1 P2 P3 ..
X
P1
P2
P2
P3
In worst-case any stage can have all prefixes
Large worst-case memory for each stage
No bound on worst-case update ? Could be O(1)
using Tree Bitmap But constant huge, 1852
memory accesses per update SIGCOMM Comm Review
04
Figure taken from Hasan et al.
13
Memory bound per stage

Figure below, shows the worst case prefix
distribution
There are 1 million prefixes, each of length
32-bits
In this case
Largest stage will be 5 MB.
Total memory size will be 80 MB
as opposed to 6 MB of the total prefix size

Moreover, a 5 MB memory cant be accessed faster
than 6 ns or so
Figure taken from Hasan et al.
14
Hardware Level Pipelining - HLP

HLP pipelines the memory accesses at hardware
level
Multiple words of memory are read together in a
pipelined manner
Throughput only limited by the memory array
access time

Such memories can improve the IP lookup throughput
As such not scalable as higher degree of
pipelining leads to a prohibitive chip area and
power dissipation
Figure taken from Sherwood et al.
15
Key Idea

HLP doesnt scale well in chip area and power
DLP scales well in power but doesnt scale well
in
Memory size (due to static level to stage
mapping)
Throughput, as one stage cant go faster than 6
ns
Combine these two (SDP)
Use a DLP, but with a better mapping so that each
stage is smaller
Use HLP at every stage to accelerate it further

16
Key Idea Use Dynamic Mapping

Map node height to stage (instead of level to
stage)
Height changes with updates, captures
distribution of prefixes below
Hence the name dynamic mapping

P1 P2 P3 ..
0 00 000 ..
X
P1
P2
P2
P3
However, the worst-case memory requirements will
remain the same, i.e. when all prefixes are
32-bit long
Figure taken from Hasan et al.
17
Key Idea Use Jump Nodes

Use Jump nodes
so that the worst-case memory requirements can
be reduced
Also restores the relation between height and
distribution

X
X
.. P4 P5 ..
.. 1 1010 ..
P4
Jump 010
P5
P4
P5
P5
However, one can argue that jump nodes will
reduce the memory requirements of SDP too, NO we
will soon see why!
Figure taken from Hasan et al.
18
Another example of Jump nodes
Note that this trie will need more than one node
operation for table updates, different from what
the paper CLAIMS!
Adding Jump nodes gt
Leaf Pushing gt
19
Tries with jump nodes
Key properties (1) Number of leaves number of
prefixes No replication Avoids
inflation of prefix expansion, leaf-pushing (2)
Updates do not propagate to subtrees No
replication (3) Each internal node has 2
children Jump nodes collapse away
single-child nodes
20
Total versus Per-Stage Memory

Jump-nodes bound total size by 2N
Would DLPJump nodes ? small per-stage memory?

log2 N
N
W - log2 N
No, DLP is still static mapping ? large
worst-case per-stage Total bounded but not
per-stage
Figure taken from Hasan et al.
21
SDPs Per-Stage Memory Bound

Proposition
Map all nodes of height h to (W-h)th pipeline
stage
Result
Size of kth stage min( N / (W-k) , 2k )

22
Key Observation 1

A node of height h has at least h prefixes in its
subtree

At least one path of length h to some leaf h -1
nodes along path Each node leads to at least 1
leaf Path has h -11 leaves h prefixes
h
Figure taken from Hasan et al.
23
Key Observation 2
No more than N / h nodes of height h for any
prefix distribution Assume more than N / h nodes
of height h Each accounts for at least h prefixes
(obs 1) Total prefixes would exceed N By
contradiction, obs 2 is true
24
Main Result of the Proposition

Map all nodes of height h to (W-h)th pipeline
stage
K-th stage has only N / (W-k) nodes from obs 2
1-bit trie has binary fanout ? at most 2k nodes
in k-th stage
Size of k-th stage min( N / (W-k) , 2k ) nodes

Dynamic pipelining (SDP)
Static pipelining (DLP)
Results in 20 MB for 1 million prefix 4x better
than DLP
Figure taken from Hasan et al.
25
Optimum Incremental Updates

1 update ? change height and stage of many nodes
Must migrate all affected nodes ? inefficient
update?

Not many nodes needs to be moved as only
ancestors heights can be affected
Each ancestor in different stage 1
node-write in each stage 1 write bubble for
any update
update
Updating SDP not just O(1) but exactly 1
Figure taken from Hasan et al.
26
Incremental Updates
1
3
2
4
5
Pipe 0
Pipe 1
Pipe 2
Pipe 3
Pipe 4
Pipe 5
3
10
7
4
2
1
6
12
9
5
8
6
7
9
8
11
13
14
15
10
11
12
13
16
17
16
17
14
15
27
Incremental Updates
1
The implementation complexity may be pretty high,
cos on the fly you might need to compute the jump
nodes (e.g. for 7)
3
2
4
5
Pipe 0
Pipe 1
Pipe 2
Pipe 3
Pipe 4
Pipe 5
3
10
2
1
7
4
6
12
9
5
8
7, Jump
6
7
9
8
11
13
14
15
15
11
12
13
16
17
16
17
28
Efficient Memory Management
Tree bit map and segmented hole compaction
requires multiple memory accesses for
updates Multibit trie with variable stride
requires even more complex memory
management SDP No variable striding /
compression ? all nodes same size No
fragmentation/compaction upon updates Memory
management is trivial and has zero fragmentation
29
Scaling SDP for Throughput

Each SDP stage can be further pipelined in
hardware
HLP ISCA03 pipelined only in hardware without
DLP
Too deep at high line-rates
Combine HLP SDP for feasibly deep hardware

1
Size 2k
2
2
of HLP stages
Size N / (W-k)
2
3
Throughput matches future line rates
Figure taken from Hasan et al.
30
Experiments
Figure taken from Hasan et al.
31
Experiments
Figure taken from Hasan et al.
32
Experiments
Figure taken from Hasan et al.
33
Discussion / Questions
Figure taken from Hasan et al.

Write a Comment

User Comments (0)