Title: Internetworking Protocols and Programming
1Internetworking Protocols and Programming CSE
5348 / 7348 Instructor Krish Pillai Session 8
2TCP Protocol
- Uses sliding window protocol to optimize network
throughput - Provides acknowledgements for guaranteed data
delivery - Provides full duplex communication through the
use of ACK and SYN numbering - Retransmissions done if ACK is not received
within a certain time limit - Retransmit time is based on RTT under normal
cases or on Karns back off strategy once a
retransmit has occurred - Advertised windows are used receiver to limit
sender from flooding receiver buffer space with
data
3Network Delay
- The timeout value based on RTT computation is
not responsive to short term variations in
network delay - Intermediate routers have to queue incoming
datagrams while they are busy referring routing
tables - This causes queue lengths to increase making
certain datagrams spend extra time in buffers
waiting to be forwarded - This results in jitter or changing round trip
time which is a function of network congestion - Queuing theory shows that the variance s in
round trip time is inversely proportional to
available network capacity
4Network Delay
- If L is the current network load expressed as a
fraction such that - 0 ? L ? 1
- The variance is RTT s is proportional to
1/(1-L) - If the network is running at 80 capacity then
the variance is 5 s - Previous technique of setting Timeout b RTT
works well for loads up to 30 - where RTT ( a Old_RTT) ( (1- a) New _RTT)
- This equation did not have a predictive (diff)
component in it - New computation for timeout value factors in
load differential for better response
5Network Delay
- Compute the difference in successive RTT
measurements - diff SAMPLE OLD_RTT
- Smoothed_RTT OLD_RTT d diff
- DEV OLD_DEV r ( DIFF - OLD_DEV )
- Timeout Smoothed_RTT h DEV
- Where
- 0 ? d ? 1 controls how quickly the new sample
affects the weighted average - 0 ? r ? 1 controls how quickly the new sample
affects the mean deviation - h 1 and controls how quickly the deviation
affects the timeout value - Original value of h was 2, but later on changed
to 4 in 4.4 BSD - It was found that this estimate responded well
but tended to underestimate RTT causing
retransmission
6Congestion Response
- Delay caused by routers under heavy load can
therefore cause TCP stacks to send unwanted
retransmissions triggering a congestion collapse - TCP should reduce transmission rates when
congestion occurs on routers - Unlike UDP, TCP provides congestion control
- Congestion Algorithms prevent senders from
overloading the network - Most implementations have the following four
basic IETF standards - Slow Start
- Congestion Avoidance
- Fast retransmit
- Fast recovery
7Slow Start Algorithm
- Routers send ICMP Source Quench if buffers fill
up, but this works only to a point - If TCP starts injecting packets based on
advertised window, a rapid rise in traffic can
cause congestion and subsequent retransmissions - Slow Start Algorithm adds a new window called
the cwnd or the Congestion Window - TCP acts as if the sliding window size is
- min ( advertised window, congestion window)
- When a connection is established cwnd is
initialized to one segment - Slow Start Algorithm is based on the observation
that - new packets should be injected into the network
at the rate at which acknowledgements are
received
8Slow Start Algorithm
- Increase congestion window by one segment each
time an acknowledgement is received - Starts with one segment being sent with cwnd set
to 1 - When ACK arrives, two segments are sent and cwnd
is set to two - when two ACKS arrive, cwnd is set to four and so
on - Congestion window is allowed to grow till it
becomes equal to the advertised window - For a receiver advertised window size of N, it
takes only log 2 N round trips before cwnd
reaches the advertised size - This mechanism for injecting packets works only
if there are no losses in the network - Losses are detected by Timeouts or duplicate ACKS
9Congestion Avoidance
- To control the rate at which packets are
injected in the event of loss, TCP uses a
different rate to inject packets - TCP uses a congestion avoidance phase when loss
is detected where cwnd is increased by one only
if all segments in the window are acknowledged - To detect this cross over point between slow
start and congestion avoidance, TCP uses a
register called ssthresh - When segment loss is detected by TCP half the
cwnd value is copied into the ssthresh register - If duplicate ACKS are detected then cwnd is then
multiplicatively reduced - If timeout was the reason cwnd is set to one
- When the receiver starts acknowledging segments
TCP uses Slow start or congestion avoidance to
grow cwnd - If cwnd ? ssthresh then TCP is in slow start,
else in congestion avoidance
10TCP Congestion Control Algorithms
- The combined algorithm works as follows
- A new connection sets cwnd to one segment and
ssthresh to 65535 bytes - The cwnd is grown according to Slow start
algorithm as ACKs are received - TCP never sends more than the lower value of the
cwnd or the receivers advertised window, which
is supplied in the ACK packet - When packet loss is detected (timeout or
duplicate ACK), one half of the cwnd is stored as
the ssthresh value and additionally if congestion
was indicated by a timeout (prolonged
congestion), cwnd is set to one segment - when new data is acknowledged by the other end ,
cwnd is increased - the way cwnd is increased depends on whether TCP
is in slow start or congestion avoidance mode - If cwnd is less than or equal to ssthresh, TCP
is in Slow Start, else TCP is in Congestion
Avoidance mode - Slow start increases cwnd exponentially while
Congestion Avoidance increase cwnd by
segsizesegsize/cwnd (linear growth) - Exponential window growth occurs until TCP is
half way up to where congestion occurred, then
window growth is linear (less aggressive)
11TCP Congestion Control Algorithms
TCP Window Growth as a function of time
cwnd advertised window
cwnd ssthresh
Delta cwnd segsizesegsize/cwnd
cwnd size
Delta cwnd 2 cwnd
Time
12Fast Retransmit
- Fast Retransmit algorithm avoids TCP waiting for
a timeout to resend lost segments (1990) - TCP sender does not know if duplicate ACKs are
due to packets being delivered out of sequence or
if the packet was indeed lost - Sender waits for a small number of ACKs to be
received - Assumption If the packets were delivered out of
order, there will be only one or two duplicate
ACKs before a new ACK is sent by the receiver - If three or more ACKs are received in
succession, it is strongly indicative of packet
loss - Sender then retransmits the (apparently) missing
segment without waiting for a timeout to occur
13Fast Recovery
- Oftentimes network congestion is transitory and
Slow Start can force TCP to lose whatever it
learnt about the network - Therefore once Fast Retransmit starts, the cwnd
is grown based on congestion avoidance (linear
growth) and not slow start (exponential growth) - This approach is termed the Fast Recovery
Algorithm - Assumption Receipt of a few duplicate ACKs
means that packets are getting through, though
some packets were dropped - Receiver can generate an ACK only when a new
packet has been received and buffered - This indicates that congestion is transient,
hence no need to reduce flow abruptly by going
into slow start (sets cwnd to one segment) - After retransmission occurs and all segments in
the current window are acknowledged, the cwnd
size is increased linearly
14Congesting and Tail Drop
- Routers handling heavy TCP traffic need
specialized recovery mechanisms from congestion - In its simplest form Routers handle congestion
by dropping new packets that arrive on its
ingress buffer Tail-Drop Policy - Tail-Drop can trigger a global response on TCP
flows that pass through a specific router - When a router that supports several TCP flows
drops packets, all supported flows are affected - All effected senders reset their cwnd to 1
segment causing an abrupt drop in traffic
allowing the router to recover - All senders may go into slow start
simultaneously - They may start growing traffic together to drive
the router back into congestion causing the
network throughput to oscillate
15Random Early Discard
- Routers use an improved scheme for overload
control - Routers set two markers for the input buffer
pool at Tmax and Tmin - If the queue contains less than Tmin add all
datagrams to the queue - If queue contains more than Tmin datagrams but
less than Tmax, randomly discard packets with a
probability of p - If queue contains Tmax datagrams then discard
all arriving packets - The probability p for discarding packets can
be increased or decreased based on the nature of
congestion - If congestion is so high that Tmax is
consistently maintained, RED degenerates to
Tail-Drop causing global oscillations - A simple approach is where p is increased from
10 to 100 through increments of 10 if sustained
congestion occurs
16Random Early Discard
- If a short burst of datagrams pushes the
indicator above Tmin, RED starts dropping packets
randomly - The queue may never get filled under such
circumstances - To avoid this from happening RED computes a
weighted average queue size - avg (1 - g ) Old_avg g Current_queue-size
- where g is a coefficient between 0 and 1
- The queue is generally measured in terms of
octets and not datagrams but discarding is done
based on datagrams - This means small datagrams have a lower
probability of being discarded compared to large
datagrams - This makes sure that pure ACKS have a lower
probability of getting dropped under congestive
situations
17Establishing a Connection
- TCP is connection oriented requiring
establishment of a connection before processes
can talk - The server (usually) issues a passive OPEN call
- Clients issue an active OPEN call
- The passive OPEN call remains dormant until a
process attempts to connect to it by an active
OPEN
Process 2
Process 1
The three-way handshake
Passive OPEN, Waits for active request
Active OPEN
1. Send SYN, seqn (ISN)
Receive SYN
2. Send SYN, seqm (ISN), ACKn1
Receive SYNACK
3. Send ACKm1
ISN Initial Sequence number
18Closing a Connection
- TCP is full duplex, therefore release signals
should be sent to both ends of the connection - One end sends the last TCP segment with the FIN
- The other process sends all its data ending with
a TCP segment with the FIN bit set - The FIN bit signals the termination of a
connection in one direction - FIN signals have to be received at both ends
before the connection is released
MSL - Maximum Segment Life Connection stays in
the TIME_WAIT state for 2MSL after Active close
Process 1
Process 2
Active close (TIMED_WAIT)
1. FIN, seq1415531522(0) ACK1823083522
Passive close Timed Wait (2MSL)
2. ACK 1415531523
3. FIN, seq 1823083522(0) ACK1415531523
4. ACK 1823083523
19Closing a Connection
- TCP takes three segments to establish a
connection - It takes four segments to terminate a connection
(Orderly release) - TCP does a half-close in either direction before
a connection is terminated - Either end may send a FIN signal
- When TCP receives a FIN, the stack notifies the
application that that the other end has
terminated connection - TCP provides this seldom used feature for
application to close transmissions one way while
continuing to receive data from the far end - Connections both ways can also be terminated by
an ABORT signal (Abortive release) with the RST
bit set in the code field - Processes can do a simultaneous active
open/close (Peer to Peer)
20Silly Window Syndrome
- TCP throughput deteriorates when one of the
machines involved in the transaction is extremely
slow - As the buffer on the receiver fills up it will
progressively advertise a smaller window - Transmission from the sender will stop when
advertisement drops to zero - Subsequent transmissions may have segments
carrying one byte at a time degrading network
throughput - This can also happen if the application sends
data in blocks of B octets and TCP transmits is
segments of M octets - if M is not a multiple of B, fractional data
fragments have to be transmitted in small
segments - This problem is termed the Silly Window Syndrome
(SWS)
21SWS Avoidance
- Avoidance of SWS can be done at the receiver and
at the sender - Receive-side Silly window avoidance
- The receiver advertises a zero window when its
buffer fills up - The receiver is then made to delay window
advertisements until the buffer empties
substantially - Window advertisements start when the buffer is
50 emptied or if enough buffer space to hold a
datagram of size MSS (max segment size) is
available - Receive-side Delayed Acknowledgements
- Once the advertised window becomes small, the
receiver can start to delay acknowledgements - If data arrives in the meantime a single
acknowledgement can signal receipt of at the
datagrams reducing reverse traffic - If ACKS are delayed too much, a retransmission
may occur - RTT computation on the sender may go awry due to
artificial delay from receiver ( should never be
delayed more than 500 ms)
22SWS Avoidance
- Send-side Silly window avoidance Nagle
Algorithm - Data is clumped into aggregates before it is
sent so that tinygrams are avoided - An adaptive technique is used by TCP to send
data accumulating in transmit buffer - Unacknowledged data is queued into a transmit
buffer until a limit is reached - Data is transmitted when limit is reached
- If buffer is still not filled transmit data when
an ACK arrives - Apply the rule even when the PUSH flag is set
- Certain highly interactive applications such as
X-traffic requires transmission of mouse and
cursor controls - The Nagle Algorithm can be turned off to improve
response time using the TCP_NODELAY socket option
23Application Program Interfaces
- Two most prevalent APIs are Berkeley Sockets and
the System V Transport Layer Interface (TLI) - Both developed for UNIX and first implemented in
C language - API design approach is to make the network I/O
as similar to file I/O as possible - File I/O supports the following six system calls
- open, creat, close, read, write, and lseek
- File I/O operations work with file descriptors
- File descriptor is an integer unique to a
process that is used to identify a file that has
been opened for I/O - Though superficially similar, the nature of
Network I/O requires more details and options
than file I/O
24Application Program Interfaces
- Need to specify whether the protocol between
Client and Server is connection-oriented or
connectionless - Only two processes take part in a transaction
based on a connection-oriented protocol - a dedicated connection is set up for each
transaction between the server process and the
client process - Multiple client processes can talk to a server
process using a connectionless protocol - Several Client processes can simply send
messages to the same server process - For efficient Client-Server application, we have
to maximize transaction throughput - Servers should processes as many client request
as possible within a specific time slot
25Server Client Model
- Iterative server - Server waits for a client
request, services it when it arrives, and goes
back to waiting for a new request - Server process should know how long it takes to
service a client request - Request arriving while Server is busy are queued
for the Server by the Kernel - Concurrent server Server starts a new process
to handle each request - Server process does not know how long it takes
to handle a request - Client and Server processes are asymmetric and
are coded differently - Servers are generally started first and clients
request later on connect to them
26Server Client Model
- Servers
- Open a communication channel and inform the
local host of it readiness to accept client
requests on a well known address (WKA) - Wait for a client request to arrive at the WKA
- For an iterative server, process the request and
send a reply. For a concurrent server, a new
process is spawned to handle this client request - Go back to step 2 and wait for another client
request - Clients
- Open a communication channel and connect to a
specific well-known address on a specific host - Send service request messages to the server, and
receive the responses - Close communications channel and terminate
27Client-Server Model
Server
(Connection-oriented protocol)
Client
Connection establishment
Blocks until connection from client
Data(request)
Data(reply)
28Client-Server Model
(Connectionless protocol)
Server
Client
Blocks until data received from client
data (request)
Process request
data (reply)
29Socket Addresses
- Most BSD networking system calls require a
pointer to a socket address structure as an
argument (defined in ltsys/socket.hgt ) - struct sockaddr
- u_short sa_family / address family AF_xxx
value / - char sa_data14 / up to 14 bytes of
protocol-specific address / -
- The struct sockaddr is a generic construct that
can hold identifiers for any protocol - For the Internet family the following structures
are defined in ltnetinet/in.hgt - struct in_addr
- u_long s_addr / 32 bit network ID host ID
/ -
- struct sockaddr_in
- short sin_family / AF_INET /
- u_short sin_port / 16 bit port number,
network byte ordered / - struct in_addr sin_addr / 32 bit netid/hostid
network byte ordered / - char sin_zero8 / unused /
-
30Elementary Socket System Calls
- socket System call returns a descriptor of
type integer - this call specifies the type of communication
protocol desired (TCP, UDP, XNS (Xerox Network
Systems) etc. - include ltsys/types.hgt
- include ltsys/socket.hgt
- int socket (int family, int type, int protocol)
- where family is
- AF_UNIX Unix Internal protocols
- AF_INET Internet protocols
- AF_NS Xerox NS protocols
- AF_IMPLINK Interface Message Processor Link Layer
31Elementary Socket System Calls
- socket System call returns a descriptor of
type integer - include ltsys/types.hgt
- include ltsys/socket.hgt
- int socket (int family, int type, int protocol)
- where type is
- SOCK_STREAM stream socket
- SOCK_DGRAM datagram socket
- SOCK_RAW raw socket
- SOCK_SEQPACKET Sequenced Packet Socket
- SOCK_RDM reliably delivered message socket
(unimplemented)
32Elementary Socket System Calls
- socket System call returns a descriptor of
type integer - include ltsys/types.hgt
- include ltsys/socket.hgt
- int socket (int family, int type, int protocol)
- where protocol is
- IPPROTO_UDP UDP
- IPPROTO_TCP TCP
- IPPROTO_ICMP ICMP
- IPPROTO_RAW Protocol Field left empty
33Elementary Socket System Calls
- bind System call assigns a name to an unnamed
socket - include ltsys/types.hgt
- include ltsys/socket.hgt
- int bind (int sockfd, struct sockaddr myaddr,
int addrlen) - Servers register their well-known addresses with
the system so that the system can forward packets
bound for this IP address and port number to the
bound process - A client can register a specific address for
itself - A connectionless client needs to assure that the
system assigns it some unique address, so that
the other end has a valid return address to send
its responses to
34Elementary Socket System Calls
- connect System call client establishes a
connection with the server - include ltsys/types.hgt
- include ltsys/socket.hgt
- int connect (int sockfd, struct sockaddr
servaddr, int addrlen) - sockfd is a descriptor returned from the socket
call - Second argument is a sockaddr filled with server
descriptors - The connect call does not return until a
connection is negotiated and established - A connection-oriented client does not have to
bind to a local address before calling connect.
Local address is auto assigned
35Elementary Socket System Calls
- listen System call connection-oriented server
indicates to the system its willingness to
receive connections - include ltsys/types.hgt
- include ltsys/socket.hgt
- int listen (int sockfd, int backlog)
- call is executed after both the socket and bind
calls and immediately before the accept system
call - Backlog defines queue for incoming connections
while the server is executing the accept command
(usually set to five) - In concurrent connection-oriented servers, the
server needs to accept a request and fork a child
process before it can do another accept. This
involves time delay with possible queue buildup
36Elementary Socket System Calls
- accept System call after connection-oriented
server calls listen it executes the accept system
call - include ltsys/types.hgt
- include ltsys/socket.hgt
- int accept (int sockfd, struct sockaddr peer,
int addrlen) - accept takes the first request in the queue and
creates another socket with the same properties
as sockfd, assigns a new descriptor and returns
this value - The sockaddr is filled with the address of the
client requesting service - addrlen is a value-result parameter. It contains
the size of the struct sockaddr before the call,
and is filled in with the size of the sockaddr
that defines the connection request
37Elementary Socket System Calls
- send/sendto System calls similar to write but
requires additional arguments - include ltsys/types.hgt
- include ltsys/socket.hgt
- int send (int sockfd, char buff, int nbytes, int
flags) - int sendto (int sockfd, char buff, int nbytes,
int flags, struct sockaddr to, int addrlen) - send call sends data into the socket defined by
sockfd. Contents of buffer pointed to by buff, up
to nbytes length is transmitted. sockaddr holds
destination address for sendto function call - The flags field is either zero or is formed by
ORing the following - MSG_OOB send out-of-band data
- MSG_DONTROUTE bypass routing (send or sendto)
38Elementary Socket System Calls
- recv/recvfrom System calls similar to read but
requires additional arguments - include ltsys/types.hgt
- include ltsys/socket.hgt
- int recv (int sockfd, char buff, int nbytes, int
flags) - int recvfrom (int sockfd, char buff, int nbytes,
int flags, struct sockaddr from, int addrlen) - Receives data from a client. The recv system call
is used with connection oriented client/servers.
Fills in from and addrlen - The flags field is either zero or is formed by
ORing the following - MSG_OOB receive out-of-band data
- MSG_PEEK peek at incoming message (recv or
recvfrom)
39Elementary Socket System Calls
- close System calls closes the socket and sends
any queued data if protocol is reliable - include ltsys/types.hgt
- include ltsys/socket.hgt
- int close (int sockfd)
- Sends any queued data is the protocol used by the
socket is reliable - Normally system tries to return from the close
immediately, but kernel attempts to send any data
queued
40Elementary Socket System Calls
- Connectionless Clients can also call the connect
system call - The connect call for UDP is a dummied call that
does not send any packets out through a UDP
socket - Local data structures for the destination
address get set up with this call - Once connected the client can use send and
recv to transmit data to the server - The server address does not have to be supplied
each time data is transmitted as in the case of
sendto and recvfrom - The term connect for a UDP client is a misnomer,
but helps code efficiency
41Elementary Socket System Calls
- getsockname System call returns local protocol
address associated with a socket - include ltsys/types.hgt
- include ltsys/socket.hgt
- int getsockname (int sockfd, struct sockaddr
localaddr, int addrlen) - If a connection-oriented client does not call
bind, getsockname can be used to return the local
IP address and local port number assigned to the
connection by the kernel - After calling bind with a port number of zero,
getsockname can be used to get the port allocated
by the system for the process
42Elementary Socket System Calls
- getpeername System call returns foreign
protocol address associated with a socket - include ltsys/types.hgt
- include ltsys/socket.hgt
- int getpeername (int sockfd, struct sockaddr
peeraddr, int addrlen) - If a connection-oriented server calls accept and
execs a child process, getpeername is the only
way the child process can obtain the clients
identity