Title: Fifteen Implementation Principles
1Fifteen Implementation Principles
- CS 685 Network Algorithmics
- Spring 2006
2Taxonomy of Principles
- P1-P5 System-oriented Principles
- These recognize/leverage the fact that a system
is made up of components - Basic idea move the problem to somebody elses
subsystem - P6-P10 Improve efficiency without destroying
modularity - Pushing the envelope of module specifications
- Basic engineering system should satisfy spec but
not do more - P11-P15 Local optimization techniques
- Akin to peephole optimizations in compilers
- Apply these after you have looked at the big
picture
3Part I Systems Principles
4P1 Avoid Obvious Waste
- Key Concept look for ways to avoid doing
something, or to avoid doing it multiple times - Example copying in protocol stacks
- In the old days, copying was not a bottleneck
- But transmission processor speeds increased
much faster than memory speeds - Today, reading/writing from/to memory is slowest
operation on a packet - Zero-copy protocol stacks never copy packet
data once it is stored in memory by the NIC - Eventually multiple copies became a bottleneck
5P2 Shift Computation in Time
- Key Concept Move expensive operations from where
time is scarce to where it is more plentiful - P2a Precompute ( shift earlier in time)
- Example computing a function by table lookup
- P2b Evaluate Lazily ( shift later in time)
- Example copy-on-write
- P2c Share Expenses ( collect things in time)
- Example garbage collection
- Examples
- initializing counter arrays lazy evaluation
batching
6P3 Relax (Sub)System Requirements
- Key Concept make one subsystems job easier by
relaxing the specification (possibly at the
expense of another subsystems job getting
slightly harder) - P3a Trade certainty for time ( use
probabilistic solutions) - P3b Trade accuracy for time ( use approximate
solutions) - Remember Good enough is good enough
- P3c Shift computation in space ( let someone
else do it) - Example DF bit in IPv4
- Purpose to help End Systems avoid fragmentation
- Why avoid?
- ease load on routers, avoid loss of APDUs
7P4 Leverage Off-System Components
- Key Concept design bottom-up, use hardware
- P4a Exploit locality ( exploit caches)
- Note not always useful (cf. IP forwarding
lookups) - P4b Trade Memory for Speed
- Note this can be interpreted two ways
- Use more memory to avoid computation (cf P12)
- Use less memory to make data structures fit in
cache - P4c Exploit hardware features
- Examples
- Some NIC cards compute TCP checksum on the board
(i.e. the OS/software does not need to compute it)
8P5 Add Hardware to Improve Performance
- Key Concept Hardware is inherently parallel, can
be very fast, and is cheap in volume consider
adding hardware to perform operations that must
be done often and are expensive in software - P5a Use memory interleaving, pipelining (
parallelism) - P5b Use Wide-word parallelism (save memory
accesses) - P5c Combine SRAM, DRAM
- Examples
- DES encryption requires permuting bits,
nonlinear mappings (some say it was designed to
require hardware for fast implementation) - Add-on boards for encryption/decryption at wire
speeds - Ephemeral State Store
9Ephemeral State Store(Associative Memory)
Implementation
Last
k
Handle
value
Next
Hash Table (SRAM)
Handle Table (DRAM)
Value Table (DRAM)
(tag, value)
A store of size 2k bindings requires 128 (h1)k
z bits where h (hash table size / store
size) and z timestamp size
10Part II Modularity With Efficiency
- Note read Clark Tennenhouse, Architectural
Considerations for a New Generation of
Protocols, Proceedings of ACM SIGCOMM 1990
11P6 Replace inefficient general routines with
efficient specialized ones
- Key Concept General-purpose routines cannot
leverage problem-specific knowledge that can
improve performance - Example
- in_pcblookup() function in BSD Unix
general-purpose state-block retrieval for both
TCP and UDP sockets - Uses simple linear search
- Not adequate for large servers with 10000s of
open sockets - Applications have very different reference
characteristics, so organize PCBs for different
apps separately (Calvert Dixon '95)
12P7 Avoid Unnecessary Generality
- Key Concept Do not include features or handle
cases that are not needed. - "When in doubt, leave it out"
- Example
- mbuf structure was designed to not waste memory
for devices that produced small amounts of input - Today memory is cheap, and most devices produce
1000 bytes at once - Most implementations use large monolithic
buffers, big enough to hold an Ethernet packet
(2K bytes or more)
13P8 Don't be tied to reference implementations
- Key Concept
- Implementations are sometimes given (e.g. by
manufacturers) as a way to make the specification
of an interface precise, or show how to use a
device - These do not necessarily show the right way to
think about the problemthey are chosen for
conceptual clarity! - Examples
- RSARef implementation of RSA cryptography
- Thread-per-layer implementations vs. upcalls
14P9 Pass hints across interfaces
- Key Concept if the caller knows something the
callee will have to compute, pass it (or
something that makes it easier to compute) as an
argument! - "hint" something that makes the recipient's
life easier, but may not be correct - "tip" hint that is guaranteed to be correct
- Caveat callee must either trust caller, or
verify (probably should do both) - Example
- Passing addresses of device-mapped pages in from
user processes (Text Section 4.1)
15P10 Pass hints in protocol headers
- Key Concept If sender knows something receiver
will have to compute, pass it in the header - Example
- Identifying state blocks
- TCP/IP method (4-tuple) is inefficient and slow
- Better have each end give the other a handle, to
be included in each packet handle can be index
into array - Include a nonce or key to validate the
information - Nonce stored in both header and state block
16Part III Local Speedup Techniques
17P11 Optimize the Expected Case
- Key Concept If 80 of the cases can be handled
similarly, optimize for those cases - P11a Use Caches
- A form of using state to improve performance
- Example
- TCP input "header prediction"
- If an incoming packet is in order and does what
is expected, can process in small number of
instructions (see code)
18P12 Add or Exploit State to Gain Speed
- Key Concept Remember things to make it easier
to compute them later - P12a Compute incrementally
- Here the idea is to "accumulate" as you go,
rather than computing all-at-once at the end - Example
- Incremental computation of Chi-Square statistic
19P13 Optimize Degrees of Freedom
- Key Concept Consider all the aspects of the
problem that might be adapted to the conditions - Example IP trie lookups, where "width" of tree
varies in the tree
20P14 Use special techniques for finite universes
(e.g. small integers)
- Key Concept when the domain of a function is
small, techniques like bucket sorting, bitmaps,
etc. become feasible. - Example Timing wheels
21P15 Use algorithmic techniques to create
efficient data structures
- Key Concept once P1-P14 have been applied, think
about how to build an ingenious data structure
that exploits what you know - Examples
- IP forwarding lookups
- PATRICIA trees (data structure) were first
- Then many other more-efficient approaches
- Packet classification
- Given a set of patterns to match 5-tuples, and a
5-tuple, find allthe first pattern(s) that it
matches
22Caveats
- These are implementation principles, not design
principles - But when you go to design new protocols, it is
very helpful to know them! (E.g. SCTP uses
receiver-chosen handles to identify state
blocks.) - There are other (probably better) ways to carve
up these ideas into groups of "Principles" - The value is in thinking about and applying them
23Cautionary Examples
- Having web servers pre-serve embedded objects
when an HTML object is requested - Varghese says they tried it, found it hurt
performance! - Two proposed reasons
- Interaction with TCP slow-start
- Client caching
- Multiple-string matching in Snort (IDS)
- Modified Boyer-Moore matching algorithm to do
"set matches" - Incorporated into Snort little improvement
- Why?
- String matching was not the bottleneck!
- A large data structure was required, which no
longer fit in cache
24Cautionary Examples, cont.
- Process-list searching in PDP-11 Unix
- Many kernel operations involved a linear search
through the list of processes - Idea use doubly-linked list of processes to
speed search, insertion, deletion - Result would've taken about twice as long for
typical table sizes needed 1000's of processes
before it would pay off!
25Cautionary Questions
- Q1 Is improvement really needed?
- Q2 Is this really the bottleneck?
- Q3 What impact will change have on rest of
system? - Q4 Does BoE-analysis indicate significant
improvement? - Q5 Is it worth adding custom hardware?
- Q6 Can protocol change be avoided?
- Q7 Do prototypes confirm the initial promise?
- Q8 Will performance gains be lost if environment
changes?
26CANEs Packet-Processing Model
Generic Forwarding Function
predefined slots
customizing code
(e.g. active congestion control)
outgoing channels
incoming channels
27Experiment Configuration
Background traffic source
Active IP router
Bottleneck link (2 Mbps)
MPEG source (avg rate 725 kbps)