Title: Dynamic Pipelining: Making IP-Lookup Truly Scalable
1Dynamic Pipelining Making IP-Lookup Truly
Scalable
- Jahangir Hasan T. N. Vijaykumar
- Presented by Sailesh Kumar
2A Simple router
At OC768, IP lookup needs to be carried out in 2
ns, can become a bottleneck
VOQs
Arriving Packets
IP Lookup
Crossbar
Routing table contains prefix, dest.
pairs IP-lookup finds dest. with longest
matching prefix
3This Papers Contribution
- This paper presents an IP lookup ASIC
architecture which addresses following 5
scalability challenges - Memory size - grow slowly with prefixes
- Lookup throughput line rate
- Implementation cost - complexity, chip area, etc
- Power dissipation - grow slowly with prefixes
and line rate - Routing table update cost O(1)
- No existing lookup architecture effectively
addresses all 5 challenges!
4Previous work
- Several IP lookup schemes proposed
- Memory access time gt packet inter-arrival time
- Must use pipelining
- Several papers have proposed using pipelining
Space Throughput Updates Power Area
TCAMs Yes Yes Yes
HLP Varghese et al ISCA03 Yes Yes
DLP Basu, Narlikar - Infocom05 Yes Yes
This paper Yes Yes Yes Yes Yes
5IP Address Lookup
- Routing tables at router input ports contain
(prefix, next hop) pairs - Address in packet is compared to stored prefixes,
starting at left. - Prefix that matches largest number of address
bits is desired match. - Packet is forwarded to the specified next hop.
routing table
nexthop
prefix
10
7
01
5
110
3
1011
5
0001
0
0101 1
7
0001 0
1
0011 00
2
1011 001
3
1011 010
5
0100 110
6
0100 1100
4
1011 0011
8
1001 1000
10
0101 1001
9
Taken from CSE 577 Lecture Notes
address 1011 0010 1000
6Address Lookup Using Tries
- Prefixes stored in alphabetical order in tree.
- Prefixes spelled out by following path from
top. - green dots mark prefix ends
- To find best prefix, spell out address in tree.
- Last green dot marks longest matching prefix.
address 1011 0010 1000
1
0
1
0
0
1
0
0
0
1
1
0
0
1
1
1
1
1
0
0
0
1
1
1
0
1
0
0
0
1
1
0
0
0
0
3
0
0
1
1
7Leaf Pushing
Leaf Pushing, push P2 to all leaves
routing table
nexthop
1
0
prefix
P2
0
P1
P1
0
1
1
P2
101
P3
1
0
P3
Every Internal node might need to store the next
hop information
Complicates the updates, as all leaves needs to
be updated
Leaf Pushing avoids using longest prefix
matching, also reduces the node size with proper
encoding
8Multibit Trie
address 101 100 101 000
- Match several bits in one step instead of single
bit. - equivalent to turning sub-trees of binary trie
into single nodes. - Each node may be associated with several
prefixes. - For stride of s, reduces tree depth by factor of
s.
9Controlled Prefix Expansion
There are schemes, which uses variable strides to
improve average case, but worst-case remains the
same
routing table
nexthop
prefix
0
P1
1
P2
101
P3
Stride 2, multibit trie
Controlled prefix expansion to align the stride
boundaries
In worst-case, controlled prefix expansion causes
non-deterministic increases in the routing table
size
10Need for Pipelined Tries
- Tomorrows routers will run at 160 Gbps, 2 ns per
packet - At most one memory access / 2 ns (may be less)
- Moreover there may be millions of prefixes
- In worst-case, memory requirements will be very
high - Memory will be slower
- Needs an architecture which
- Uses multiple smaller memories
- Accesses them in a pipelined manner
11Pipelined Trie-based IP-lookup
Tree data-structure, prefixes in leaves (leaf
pushing) Process IP address level-by-level to
find the longest match
1
0
1
0
P4 10010
0
1
1
0
P6
P7
P1
P2
P4
P5
P3
- Each level in different stage ? overlap multiple
packets
12Closest Previous Work
Data Structure Level Pipelining (DLP) - level to
stage mapping
- Maps trie level to stage but this is a static
mapping - Updates change prefix distribution but mapping
persists
0 00 000 ..
P1 P2 P3 ..
X
P1
P2
P2
P3
In worst-case any stage can have all prefixes
Large worst-case memory for each stage
No bound on worst-case update ? Could be O(1)
using Tree Bitmap But constant huge, 1852
memory accesses per update SIGCOMM Comm Review
04
Figure taken from Hasan et al.
13Memory bound per stage
- Figure below, shows the worst case prefix
distribution - There are 1 million prefixes, each of length
32-bits - In this case
- Largest stage will be 5 MB.
- Total memory size will be 80 MB
- as opposed to 6 MB of the total prefix size
Moreover, a 5 MB memory cant be accessed faster
than 6 ns or so
Figure taken from Hasan et al.
14Hardware Level Pipelining - HLP
- HLP pipelines the memory accesses at hardware
level - Multiple words of memory are read together in a
pipelined manner - Throughput only limited by the memory array
access time
Such memories can improve the IP lookup throughput
As such not scalable as higher degree of
pipelining leads to a prohibitive chip area and
power dissipation
Figure taken from Sherwood et al.
15Key Idea
- HLP doesnt scale well in chip area and power
- DLP scales well in power but doesnt scale well
in - Memory size (due to static level to stage
mapping) - Throughput, as one stage cant go faster than 6
ns - Combine these two (SDP)
- Use a DLP, but with a better mapping so that each
stage is smaller - Use HLP at every stage to accelerate it further
16Key Idea Use Dynamic Mapping
- Map node height to stage (instead of level to
stage) - Height changes with updates, captures
distribution of prefixes below - Hence the name dynamic mapping
P1 P2 P3 ..
0 00 000 ..
X
P1
P2
P2
P3
However, the worst-case memory requirements will
remain the same, i.e. when all prefixes are
32-bit long
Figure taken from Hasan et al.
17Key Idea Use Jump Nodes
- Use Jump nodes
- so that the worst-case memory requirements can
be reduced - Also restores the relation between height and
distribution
X
X
.. P4 P5 ..
.. 1 1010 ..
P4
Jump 010
P5
P4
P5
P5
However, one can argue that jump nodes will
reduce the memory requirements of SDP too, NO we
will soon see why!
Figure taken from Hasan et al.
18Another example of Jump nodes
Note that this trie will need more than one node
operation for table updates, different from what
the paper CLAIMS!
Adding Jump nodes gt
Leaf Pushing gt
19Tries with jump nodes
Key properties (1) Number of leaves number of
prefixes No replication Avoids
inflation of prefix expansion, leaf-pushing (2)
Updates do not propagate to subtrees No
replication (3) Each internal node has 2
children Jump nodes collapse away
single-child nodes
20Total versus Per-Stage Memory
- Jump-nodes bound total size by 2N
- Would DLPJump nodes ? small per-stage memory?
log2 N
N
W - log2 N
No, DLP is still static mapping ? large
worst-case per-stage Total bounded but not
per-stage
Figure taken from Hasan et al.
21SDPs Per-Stage Memory Bound
- Proposition
- Map all nodes of height h to (W-h)th pipeline
stage - Result
- Size of kth stage min( N / (W-k) , 2k )
22Key Observation 1
- A node of height h has at least h prefixes in its
subtree
At least one path of length h to some leaf h -1
nodes along path Each node leads to at least 1
leaf Path has h -11 leaves h prefixes
h
Figure taken from Hasan et al.
23Key Observation 2
No more than N / h nodes of height h for any
prefix distribution Assume more than N / h nodes
of height h Each accounts for at least h prefixes
(obs 1) Total prefixes would exceed N By
contradiction, obs 2 is true
24Main Result of the Proposition
- Map all nodes of height h to (W-h)th pipeline
stage - K-th stage has only N / (W-k) nodes from obs 2
- 1-bit trie has binary fanout ? at most 2k nodes
in k-th stage - Size of k-th stage min( N / (W-k) , 2k ) nodes
Dynamic pipelining (SDP)
Static pipelining (DLP)
Results in 20 MB for 1 million prefix 4x better
than DLP
Figure taken from Hasan et al.
25Optimum Incremental Updates
- 1 update ? change height and stage of many nodes
- Must migrate all affected nodes ? inefficient
update?
Not many nodes needs to be moved as only
ancestors heights can be affected
Each ancestor in different stage 1
node-write in each stage 1 write bubble for
any update
update
Updating SDP not just O(1) but exactly 1
Figure taken from Hasan et al.
26Incremental Updates
1
3
2
4
5
Pipe 0
Pipe 1
Pipe 2
Pipe 3
Pipe 4
Pipe 5
3
10
7
4
2
1
6
12
9
5
8
6
7
9
8
11
13
14
15
10
11
12
13
16
17
16
17
14
15
27Incremental Updates
1
The implementation complexity may be pretty high,
cos on the fly you might need to compute the jump
nodes (e.g. for 7)
3
2
4
5
Pipe 0
Pipe 1
Pipe 2
Pipe 3
Pipe 4
Pipe 5
3
10
2
1
7
4
6
12
9
5
8
7, Jump
6
7
9
8
11
13
14
15
15
11
12
13
16
17
16
17
28Efficient Memory Management
Tree bit map and segmented hole compaction
requires multiple memory accesses for
updates Multibit trie with variable stride
requires even more complex memory
management SDP No variable striding /
compression ? all nodes same size No
fragmentation/compaction upon updates Memory
management is trivial and has zero fragmentation
29Scaling SDP for Throughput
- Each SDP stage can be further pipelined in
hardware - HLP ISCA03 pipelined only in hardware without
DLP - Too deep at high line-rates
- Combine HLP SDP for feasibly deep hardware
1
Size 2k
2
2
of HLP stages
Size N / (W-k)
2
3
Throughput matches future line rates
Figure taken from Hasan et al.
30Experiments
Figure taken from Hasan et al.
31Experiments
Figure taken from Hasan et al.
32Experiments
Figure taken from Hasan et al.
33Discussion / Questions
Figure taken from Hasan et al.