Title: The Peregrine HighPerformance RPC System
1The PeregrineHigh-Performance RPC System
- David B. Johnson and
- Willy Zwaenepoel
- Department of Computer Science
- Rice University
- Presented By Khaled Elmeleegy
- Assisted by Moez Abdel-gawad
2Overview
- The Peregrine is an RPC system.
- It tries to optimize the RPC.
- The paper relies on experimental results.
3Key optimizations
- No intermediate copies of arguments or results.
- No data conversion between client and server
(unless needed). - Preallocated and precomputed header templates for
transmitted packets.
4Key optimizations (continued)
- No thread-specific state is saved between calls
in the server. - Arguments are mapped into the servers address
space, rather than being copied. - In Multi-packet arguments, most copying overlaps
with transmission of next packet.
5Implementation
Client
Server
Client Stub
Server Stub
Server thread
remote procedure
Application
Kernel
Kernel
Call Message
Local call Local return
Trap Return
Transmit (gather DMA) Receive
Reinitialize Transmit (DMA)
Jump
Call Trap
Execute Return
Return message
RPC in Peregrine
6Implementation (Contd)
- In Peregrine, the kernel is responsible for
- 1-Getting RPC messages from one address space to
another. (usually from a machine to another) - 2-Reinitializing a free thread in the server when
a call message arrives, that handles the call
including the binding. - 3-Unblocking the client thread when the return
message arrives.
7Implementation (Contd)
- Unlike the previous paper
- No RPCruntime, instead its the kernels
responsibility to transfer messages reliably. - Pool of threads instead of pool of processes,
which gives better performance. - All processing specific to the particular server
procedure being called is performed in the stubs.
8Hardware Requirements
- The Peregrine implementation utilizes
- The ability to re-map memory pages between
address spaces by manipulating the page-table
entries.
9Hardware Requirements (contd)
- The gather DMA capability of the Ethernet
controller.
P1
P1
P2
P2
P3
P3
P4
P5
P2
P9
P1
P6
P7
P8
Network
P9
Clients address space
Servers address space
10Implementation of the optimizations
- Gather DMA is used to send arguments/results,
instead of expensive copying. - No data conversion. (unless needed)
- Use of packet header templates.
- Received packet is mapped into the threads
stack, to avoid copying.
11Implementation of the optimizations (contd)
Received call packet in one of the servers
Ethernet receive buffer pages
12Used optimization techniques (contd)
- Server thread doesnt save or restore its
registers in-between different RPCs. (as its a
jump not a call)
13Multi-Packet Network RPC
- For a network RPC message containing the argument
or result values is larger than the data portion
of a single Ethernet RPC packet, the message is
broken into multiple packets. - As in the single-packet case, the data are
transmitted directly from the clients address
space using gather DMA to avoid copying.
14Multi-Packet Network RPC (contd)
- Other than packets transmission, the execution of
a multi-packet network RPC is the same as for the
single-packet case.
15Multi-Packet Network RPC (contd)
Example multi-packet call transmission and
reception
16Local RPC
- Between two threads executing on the same
machine. - Memory mapping is used to move the call arguments
and results between the clients and servers
address spaces. - The execution is the same as for network RPC.
17Performance Numbers
Peregrine RPC performance for single-packet
network RPCs (microseconds)
18Performance Numbers (contd)
Peregrine RPC performance for multi-packet
network RPCs
19Effectiveness of the Optimizations
- Not copying memory for either the arguments or
the results was shown to be very efficient
optimization. - In case of multi-packet RPC, not copying during a
critical path, was efficiently time saving as
well. - And not doing data representation conversion if
not needed was yet another effective optimization.
20Conclusion
- Peregrine, by trying to
- Avoid expensive copies.
- Expensive data representation conversions.
- Recomputation of packet headers.
- And reducing overhead for thread management.
- Achieves a performance that is very close to the
hardware latency, both for network RPCs, and for
local RPCs.