Title: CS556: Distributed Systems
1CS-556 Distributed Systems
Inter-process Communication (III)
- Manolis Marazakis
- maraz_at_csd.uoc.gr
2Berkeley Sockets (I)
- Socket primitives for TCP/IP.
3Berkeley Sockets (II)
- Connection-oriented communication pattern using
sockets.
4Connected vs Connectionless (I)
- IP ? best-effort, unreliable, connectionless
- Remembers nothing about a packet after it has
sent it - Checksum computed on header only
- No assumptions about the underlying physical
medium - Serial link, Ethernet, Token ring, X.25, ATM,
wireless CDPD, - UDP
- (optional) checksum
- notion of port
5Connected vs Connectionless (II)
- TCP ? reliable connection-oriented service
- Segments are sent in IP datagrams
- Checksum of data in each segment
- Sequence of the 1st byte in the segment
- Acknowledge-and-retransmit mechanism
- Each side maintains a receive window
- Range of sequence that this side is prepared to
receive - Any arriving data with sequence outsiode the
receive window is discarded - Queuing of data arriving out-of-order
- Window slides to the right, if the next expected
sequence has arrived - and an ACK is sent back with the sequence
expected next - Send window
- Bytes sent but not yet acknowledged
- RTO timer (retransnmission timeout)
- Timeout does not always mean that the data was
lost !! - Bytes that can be sent but have not yet been sent
6UDP Failure Model
- Omission failures
- timeouts
- duplicate messages
- lost messages
- Need to maintain history
- Last reply sent to each client
- provided that a client can make only one request
at a time - interprets each request as the ACK for the
previous reply - periodic purge of history
- No ACK for the last response received before
client terminates - Fixed max. buffer size (8 KB)
- No message order guarantee
- Process crash failures
7TCP Failure Model
- Reliable message delivery
- checksums, sequence numbers, timeouts
- no need for applications to deal with
- retransmissions
- duplicates
- reordering
- no need for histories
- Flow control mechanism
- large transfers without overwhelming the receiver
- BUT not reliable sessions
- Connections may be severed or severely congested
- Processes cannot distinguish network from process
failure - Processes cannot tell if their recent messages
were received
8TCP is a stream protocol
- No inherent notion of message boundary
- The amount of data in a packet is not directly
related to the amount of data delivered to TCP in
the send() call - No reliable for the receiver to determine how the
data was packetized - Several packets may have arrived between recv()
calls - The amount of data returned in any given read()
is unpredictable - Fixed-length messages
- Variable-length messages
- End-of-record marker
- Fixed-length header (including record length)
variable data
9TCP Failure Modes (I)
- TCP guarantees delivery of the data it sends
- True or False ?
False How can we handle outages crashes ?
Guarantee to whom ?
10TCP Failure Modes (II)
- IP is a best-effort, unreliable protocol
- so the TCP layer is the first place in the data
path where it makes senses to even talk about
guarantees - The senders TCP layer can make no guarantee
about segments that arrive at the receivers TCP
layer - An arriving segment may be corrupted, or it may
contain duplicate data, or it may be out of order
- The receivers TCP layer guarantees to the
senders TCP layer that any segment that it ACKs
all data that came before it have been
correctly received - This does not mean that the data has been
delivered to the application ot that it will
ever be delivered !! - For example, the receiving host may crash after
the ACK but before delivery
11TCP Failure Modes (III)
- It also makes sense to talk about guarantees at
application B (receiver) - There can be no guarantee that all data sent by
application A will arrive - However, all data that does arrive will be in
order and uncorrupted
Avoid the attitude that TCP will take care of
everything
TCP is an end-to-end protocol, providing a
reliable transport mechanism between peers
The peers are the TCP layers of the sender
the receiver !!
12TCP Failure Modes (IV)
- Explicit acknowledgements
- What does the client do if the server does not
ACK receipt ?? - It may not be safe to simply resend a request
When a problem occurs at an endpoint, there is
generally no alternative path ? The problem
persists until it is repaired
An intermediate router may send the originator an
ICMP message indicating that the destination
network or the host is unreachable
OR The sender eventually times-out resends the
segments not ACKed. This continues until the
sender gives up drops the connection (9
minutes). Pending read ? ETIMEDOUT Otherwise, the
next write fails ? SIGPIPE or EPIPE
13TCP Failure Modes (V)
- Peer crash
- Indistinguishable from the case of the peer
calling close() and then exit() - The peers TCP layer issues a FIN segment
- This does not necessarily imply that the peer has
no more data to send, or even that it is not
willing to receive more data - Reception of the FIN may come at different
execution states of the application - If client is blocked, TCP has no way of notifying
it - The next transmission generates a RST segment ?
ECONNRESET - If the RST is ignored more data is transmitted
? SIGIPE - This may occur if the client performs gt2
consecutive write() calls without an intervening
read() ? Notification takes place only after the
2nd write() - If client has a pending read(), it gets an
immediate error indication (eg read() returns
EOF)
14TCP Failure Modes (VI)
- Peers host crash
- The peers TCP cannot issue the FIN segment
- Until recovery, this case cannot be distinguished
from a network outage - The peers TCP no longer responds, but the sender
keeps retransmitting - Until either the host recovers, or the sender
gives up the connection ? ETIMEDOUT - If the host reboots before the sender gives up, a
retransmitted segment may arrive at the TCP layer
without it having knowledge of the connection ?
RST - If sender has a read() pending ? ECONNRESET
- Else, the next write() results in a SIGPIPE signal
15Behavior of Peers
- Checking for client termination
- Heartbeats, timeouts for read operations,
SO_KEEPALIVE option, - Checking for valid input
- Buffer overflow errors
16We rely on DNS
17The Message-Passing Interface
- Some of the most intuitive primitives of MPI.
18Group Communication
- Multicasting 1-to-many comm. pattern
- Applications
- replicated services (better fault tolerance)
- discovery of services
- replicated data (better performance)
- propagation of event notifications
- Failure model
- depends on implementation
- IP multicast (UDP datagrams) omission failures
- class-D Inet addresses 1110 bit prefix
- TTL
- reliable multicast
- ordered multicast
- FIFO
- Causal
- Total
19Conventional Procedure Call
- Parameter passing in a local procedure call the
stack before the call to read - The stack while the called procedure is active
20Software layers
RPC is more than a (transport) protocol a
structuring mechanism for distributed systems
21Steps of a Remote Procedure Call
- Client procedure calls client stub in normal way
- Client stub builds message, calls local OS
- Client's OS sends message to remote OS
- Remote OS gives message to server stub
- Server stub unpacks parameters, calls server
- Server does work, returns result to the stub
- Server stub packs it in message, calls local OS
- Server's OS sends message to client's OS
- Client's OS gives message to client stub
- Stub unpacks result, returns to client
22Client and Server Stubs
- Principle of RPC between a client server
program.
23Example (Sun RPC - ONC)
- long square(long) example
- client ren.eecis.udel.edu 11
- result 121
- Need RPC specification file (square.x) defines
procedure name, arguments results - Run rpcgen square.x generates square.h,
square_clnt.c, square_xdr.c, square_svc.c - square_clnt.c square_svc.c Stub routines for
client server - square_xdr.c XDR (External Data Representation)
code - takes care of data type conversions
24RPC Specification File (square.x)
struct square_in long arg1 struct
square_out long res1 program SQUARE_PROG
version SQUARE_VERS square_out
SQUAREPROC(square_in) 1 // procedure
1 // version 0x321230000 // program
IDL Interface Definition Language
25Parameter Specification Stub Generation
procedure
Corresponding message
26Writing a Client a Server
- The steps in writing a client a server in DCE
RPC.
27Binding (SUN RPC)
- Port Mapper (rpcbind) listens at UDP port 111
- Server registers program ID version
- rpcinfo -p -gt display all registered RPC servers
- When client issues clnt_create, the port mapper
is contacted - program-to-port number mapping
- arguments (program ID, version, protocol)
- response servers port number
28Binding (DCE)
29Passing Value Parameters (I)
30Passing Value Parameters (II)
- a. Original message on Pentium (little-endian)
- b. The message after receipt on SPARC
(big-endian) - c. The message after being inverted.
31Passing Value Parameters (III)
- How to pass pointers ?
- Meaningful only within a specific address space !
- Arrays (of known length) structures
- Copy/restore semantics (bet. stubs)
- IN/OUT/INOUT markers
- Optimization may eliminate one copy operation
- Pointer to an arbitrary data structure ?
- No general solution
- Work-around
- Pass back the pointer to its source
32External Data Representation (I)
- Data structures
- flattened on transmission
- rebuilt upon reception
- Primitive data types
- byte order (big-endian MSB comes first)
- ASCII vs UNICODE (2 bytes per character)
- marshalling/unmarshalling
- to/from agreed external format
33External Data Representation (II)
- XDR (RFC 1832), CDR (CORBA), Java
- data -gt byte stream
- object references
- HTTP/MIME
- data -gt ASCII text
34CORBA CDR example
35Properties of TCP
- Connected vs Connectionless Protocols
- TCP is a stream protocol
- Performance of TCP
- Avoid re-inventing TCP !!
- TCP failure modes
- Behaviour of peers
- LAN vs WAN testing
- Tools Resources
36Basic socket calls
SERVER
CLIENT
37Performance of TCP (I)
- 4.4BSD Implementation
- UDP 800 LOC
- TCP 4,500 LOC
- CPU processing checksums, data copying
- TCP ACKs
- Receiver can piggyback the ACK
- Usually every second segment is ACKed
- .. May even delay ACKs (up to 0.5 sec)
- Connection setup 3 segments
- 1 ½ RTT SYN, SYNACK, ACK
- Connection tear-down 4 segments
- FIN, ACK, FIN (server-to-client), ACK
- Except the last segment, these can be combined
with data-bearing segments
38Performance of TCP (II)
- Results from a benchmark involving transmission
of 5,000 data blocks - UDP datagram sizeTCP write size1,440 bytes
- Ethernet frame 1,500 bytes
- IP header 20 bytes, TCP header 20 bytes
- TCP options 12 bytes
- Average over 50 runs
- Client produces data blocks, transmits them, and
then exits - Server may run on
- localhost (127.0.0.1)
- Same host as the client, but given as an address
- Other host
39Performance of TCP (III)
Localhost (loop-back) MTU16,384
Client (network I/f) MTU1,500
40Performance of TCP (IV)
Results for write size300 bytes
41Avoid re-inventing TCP !!
- Retransmissions ?
- RTO
- Must be adjustable
- Exponential back-off
- Flow control
- Sliding window
- Congestion control
- Matching replies to requests ?
- Sequence for each request
- Efficiency of the implementation ?
- TCP code base is highly optimized
- and runs in kernel-space
42LAN vs WAN testing
- Performance on the WAN may not be satisfactory,
due to the extra latency - may have to reconsider the design
- Incorrect code is more likely to be triggered on
the WAN - assumptions on volume/rate of arriving data
43HTTP
- Methods
- GET, HEAD, POST
- PUT, DELETE, TRACE, OPTIONS
- Resource MIME-encoded data
- Content negotiation
- Authentication
44Tools (I)
- ping
- IP header ICMP echo request/reply
- tcpdump
- Network analyzer sniffer
- traceroute
- Determine the network path by forcing each
intermediate router to send an ICMP error message
to the originator - Send a UDP datagram with TTL1 - so that the 1st
router in the path will discard it ! - Send a 2nd UDP datagram with TTL2 so that the
2nd router in the path will discard it ! -
- At the last hop, TTL1 an attempt is made to
deliver the datagram (generating the ICMP error
message port unreachable)
45Tools (II)
- ttcp
- Benchmarking tool, with many- parameters
- UDP or TCP transfers, buffers, size of
read/writes - lsof
- Determine which process has a file descriptor
open (file or socket) - lsof i TCP6000
- lsof i _at_remotehost.xdomain.net
- netstat
- Active sockets netstat af inet
- Interfaces netstat i
- Routing table netstat -rn
- Protocol statistics netstat sp tcp
- System call tracers strace, truss, ktrace
46Resources
- Books
- Richard Stevens
- TCP/IP illustrated series
- Protocols, Implementation, T/TCP/HTTP/NNTP/Domain
Sockets - UNIX Network Programming series
- Networking APIs Sockets, XTI
- Interprocess Communication
- J.C. Snader Effective TCP/IP Programming
- RFCs http//www.rfc-editor.org