Title: How the TCP/IP Protocol Works
1How the TCP/IP Protocol Works
- Les Cottrell SLAC
- Lecture 1 presented at the 26th International
Nathiagali Summer College on Physics and
Contemporary Needs, 25th June 14th July,
Nathiagali, Pakistan
Partially funded by DOE/MICS Field Work Proposal
on Internet End-to-end Performance Monitoring
(IEPM), also supported by IUPAP
2Overview
- This is not a lecture on how to program TCP/IP,
rather an introduction to how major portions
works - IP
- Addressing IP addresses, ARP, routing
- ICMP
- UDP
- TCP flow control, error recovery, establishment,
diconnect - References
- Internetworking with TCP/IP, volume I,
principles, protocols Architecture, by Douglas
Comer - TCP/IP Illustrated the protocols, by W.
Richard Stevens - Most information also available free via Web
searches
3Internet Protocol (IP RFC-791)
TCP/IP Internet provides 3 layers of service
Application services
Connectionless packet delivery service
- Layering allows one to replace one service
without affecting others - IP layer (basic unit of transfer in TCP/IP)
provides - Best-effort (does not discard capriciously),
unreliable (no guarantees) - Packet may be lost, duplicated, out-of-order with
no notification - Connectionless (each packet treated
independently) - IP software provides routing
4Internet datagram
- Basic transfer unit
- Format of Internet datagram
Datagram header
Datagram data area
0
8
16
31
24
4
19
Vers
Type of serv.
Total length
Hlen
Identification
Flags
Fragment offset
TTL
Protocol
Header Checksum
Source IP address
Destination IP address
IP Options (if any)
Padding
Data
5IP datagram format (cont.)
- Vers (4 bits) version of IP protocol (IPv44)
- Hlen (4 bits) Header length in 32 bit words,
without options (usual case) 20 - Type of Service TOS (8 bits) little used in
past, now being used for QoS - Total length (16 bits) length of datagram in
bytes, includes header and data - Time to live TTL (8bits) specifies how long
datagram is allowed to remain in internet - Routers decrement by 1
- When TTL 0 router discards datagram
- Prevents infinite loops
- Protocol (8 bits) specifies the format of the
data area - Protocol numbers administered by central
authority to guarantee agreement, e.g. TCP6,
UDP17
6IP Datagram format (cont.)
- Source destination IP address (32 bits each)
contain IP address of sender and intended
recipient - Options (variable length) Mainly used to record
a route, or timestamps, or specify routing
7IP Fragmentation
- How do we send a datagram of say 1400 bytes
through a link that has a Maximum Transfer Unit
(MTU) of say 620 bytes? - Answer the datagram is broken into fragments
- Router fragments 1400 byte datagrams
- Into 600 bytes, 600 bytes, 200bytes (note 20
bytes for IP header) - Routers do NOT reassemble, up to end host
Net 1 MTU1500
Net 3 MTU1500
Net 2 MTU620
8Fragmentation Control
- Identification copied into fragment, allows
destination to know which fragments belong to
which datagram - Fragment Offset (12 bits) specifies the offset
in the original datagram of the data being
carried in the fragment - Measured in units of 8 bytes starting at 0
- Flags (3 bits) control fragmentation
- Reserved (0-th bit)
- Dont Fragment DF (1st bit)
- useful for simple (computer bootstrap)
application that cant handle - also used for MTU discovery (see later)
- if need to fragment and cant router discards
sends error to source - More Fragments (least sig bit) tells receiver it
has got last fragment - TCP traffic is hardly ever fragmented (due to use
of MTU discovery). About 0.5 - 0.1 of TCP
packets are fragmented .
9Fragment series composition
Offset0 More frags
Offset1480 More frags
Offset2960 More frags
Offset3440 Last frag
NB. If data segment contains its own header that
is not replicated
10Internet Addressing
- IP address is a 32 bit integer
- Refers to interface rather than host
- Consists of network and host portions
- Enables routers to keep 1 entry/network instead
of 1/host - Class A, B, C for unicast
- Class D for multicast
- Class E reserved
- Classless addresses
- Written as 4 octets/bytes in decimal format
- E.g. 134.79.16.1, 127.0.0.1
11Internet Class-based addresses
- Class A large number of hosts, few networks
- 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh
- 7 network bits (0 and 127 reserved, so 126
networks), 24 host bits (gt 16M hosts/net) - Initial byte 1-127 (decimal)
- Class B medium number of hosts and networks
- 10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh
- 16,384 class B networks, 65,534 hosts/network
- Initial byte 128-191 (decimal)
- Class C large number of small networks
- 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh
- 2,097,152 networks, 254 hosts/network
- Initial byte 192-223 (decimal)
- Class D 224-239 (decimal) Multicast RFC1112
- Class E 240-255 (decimal) Reserved
12Subnets
- A subnet mask is applied to the host bits to
determine how the network is subnetted, e.g. if
the host is 137.138.28.228, and the subnet mask
is 255.255.255.0 then the right hand 8 bits are
for the host (255 is decimal for all bits set in
an octet) - Host addresses of all bits set or no bits set,
indicate a broadcast, i.e. the packet is sent to
all hosts.
13Subnet Mask Conversions
Prefix Length
Prefix Length
Subnet Mask
Subnet Mask
/1 128.0.0.0 /2 192.0.0.0 /3 224.0.0.0 /4 240.
0.0.0 /5 248.0.0.0 /6 252.0.0.0 /7 254.0.0.0 /8
255.0.0.0 /9 255.128.0.0 /10 255.192.0.0 /11
255.224.0.0 /12 255.240.0.0 /13 255.248.0.0 /14
255.252.0.0 /15 255.254.0.0 /16 255.255.0.0
/17 255.255.128.0 /18 255.255.192.0 /19 255.255
.224.0 /20 255.255.240.0 /21 255.255.248.0 /22
255.255.252.0 /23 255.255.254.0 /24 255.255.255
.0 /25 255.255.255.128 /26 255.255.255.192 /27
255.255.255.224 /28 255.255.255.240 /29 255.255.
255.248 /30 255.255.255.252 /31 255.255.255.254
/32 255.255.255.255
Decimal Octet
Binary Number
128 1000 0000 192 1100 0000
224 1110 0000 240 1111 0000 248 1111
1000 252 1111 1100 254 1111 1110
255 1111 1111
14Address depletion
- In 1991 IAB identified 3 dangers
- Running out of class B addresses
- Increase in nets has resulted in routing table
explosion - Increase in net/hosts exhausting 32 bit address
space - Four strategies to address
- Creative address space allocation RFC 2050
- Private addresses RFC 1918, Network Address
Translation (NAT) RFC 1631 - Classless InterDomain Routing (CIDR) RFC 1519
- IP version 6 (IPv6) RFC 1883
15Creative IP address allocation
- Class A addresses 64 127 reserved
- Handle on individual basis
- Class B only assigned given a demonstrated need
- Class C
- divided up into 8 blocks allocated to regional
authorities - 208-223 remains unassigned and unallocated
- Three main registries handle assignments
- APNIC Asia Pacific www.apnic.net
- ARIN N. S. America, Caribbean sub-Saharan
Africa www.arin.net - RIPE Europe and surrounding areas www.ripe.net
16Private IP Addresses
- IP addresses that are not globally unique, but
used exclusively in an organization - Three ranges
- 10.0.0.0 - 10.255.255.255 a single class A net
- 172.16.0.0 - 172.31.255.255 16 contiguous class
Bs - 192.168.0.0 192.168.255.255 256 contiguous
class Cs - Connectivity provided by Network Address
Translator (NAT) - translates outgoing private IP address to
Internet IP address, and a return Internet IP
address to a private address - Only for TCP/UDP packets
17Class InterDomain Routing (CIDR)
- Many organization have gt 256 computers but few
have more than several thousand - Instead of giving class B (16384 nets) give
sufficient contiguous class C addresses to
satisfy needs - lt 256 addresses assign 1 class C
-
- lt 8192 addresses assign 32 contiguous Class C
nets
18CIDR Supernetting
- Since assigned contiguously, class C CIDR has
same most significant bits so only needs one
routing table entry - CIDR block represented by a prefix and prefix
length - Prefix single address representing block of
nets, e.g - 192.32.136.0 11000000 00100000 10001000
00000000 while - 192.32.143.0 11000000 00100000 10001111
00000000 - Prefix length indicates number of routing bits,
e.g. - 192.32.136.0/21 means 21 bits used for routing
- CIDR collects all nets in range 192.32.136.0
through 143.0 into a single router entry
reduces router table entries - Removes address classes A, B C boundaries
- For more details see RFC 1519
21 bit prefix (2048 host addresses)
19Address Recognition Protocol (ARP)
- IP address is at network layer, need to map it to
the MAC (Ethernet address) link layer address - Use ARP to map 48 bit Ethernet address to 32 bit
IP - IP requests MAC address for IP address from local
ARP table - If not there, then an ARP request packet for IP
address is sent using physical broadcast address
(all FFFs) - Host with requested IP address responds with its
MAC address as a unicast packet - On return, host updates ARP table and returns MAC
address - ARP cache times out
- ARP packets are on top of Ethernet
20ARP cont.
- ARP requests are local only, do not cross routers
- Compare local IP and subnet mask gt local subnet
- Compare local subnet to destination IP
- if local, ARP for MAC address
- else remote so
- if ROUTE entry, ARP for router to subnet
- if default route, ARP for default gateway
- otherwise, drop packet return error
Subnet 1
Subnet 2
134.79.10.17
134.79.15.3
134.79.15.1
134.79.10.1
User A
User B
21Routing
- Routers must select next hop for packet
- Get route information from other routers via a
routing protocol (RIP, OSPF, EIGRP etc.) - Note the following are non-routable
- private networks 10.0.0.0/8, 172.16.0.0/12,
192.168.0.0/16 - Loopback 127.0.0.0/24
22ICMP Purpose (RFC 792)
- Communicates control error information
- Between routers and hosts
- Only reports to original source, suggests
corrections - Error messages about error messages are not
generated - Never generated due to multicasts
- Packet format
0
8
16
31
24
Type
Code
Checksum
ICMP data (depends on type/code)
23Main ICMP request types
24ICMP Echo/Ping
- Very commonly used diagnostic tool
- Implementations vary between OS
- Build echo request
- Identifier used to match request to replies (e.g.
pid) - Sequence number, starts at 0 increments by 1 for
each ping packet - Used to detect loss, reorder, duplicates
- Optional data, sent by requester, returned by
replier - Usually contains a timestamp when the request was
sent plus pad data
0
8
16
31
24
Type8
Code0
Checksum
Identifier
Sequence number
Optional data
25What do we learn from Ping
- Host reachable
- Host may respond to ping but not be running
services - Round trip timing
- Lost packets
- Packet reordering duplicate packets
- Example
13cottrell_at_noric05gtping -c 4 lhr.comsats.net.pk
PING lhr.comsats.net.pk (210.56.16.10) from
134.79.125.205 56(84) bytes of data. 64 bytes
from lhr.comsats.net.pk (210.56.16.10)
icmp_seq0 ttl242 time716.962 msec 64 bytes
from lhr.comsats.net.pk (210.56.16.10)
icmp_seq1 ttl242 time720.375 msec 64 bytes
from lhr.comsats.net.pk (210.56.16.10)
icmp_seq2 ttl242 time725.907 msec 64 bytes
from lhr.comsats.net.pk (210.56.16.10)
icmp_seq3 ttl242 time710.734 msec ---
lhr.comsats.net.pk ping statistics --- 4 packets
transmitted, 4 packets received, 0 packet
loss round-trip min/avg/max/mdev
710.734/718.494/725.907/5.566 ms
26Unreachable
76cottrell_at_flora06gtping islamabad-server2.comsat
s.net.pk ICMP 13 Unreachable from gateway
207.45.205.18 for icmp from FLORA06.SLAC.Stanford
.EDU (134.79.16.101) to islamabad-server2.comsats.
net.pk (210.56.8.8) What does this mean, see
exercise?
27Time Exceeded
- Time-to-live has expired at a router (code0)
- ttl sets bound on number routers datagram can
transit - Prevents infinite routine loops
- Initialized by sender, decremented by 1 each time
passes router - When ttl 0 datagram thrown away sender
notified by ICMP message - Fragment reassembly timer (code1)
0
8
16
31
24
Type 11
Code
Checksum
Unused
Internet header 8 bytes of data
28MTU Discovery
- Path MTUs vary
- Fragmentation is bad
- Small transmission units are bad
- SO need to discover optimum MTU (largest without
fragmentation) - Host sends a packet with the Dont Fragment bit
set - Length is lesser of local MTU and MSS announced
by remote system - If MTU between hosts requires fragmentation (e.g.
at an intermediate router), then - if an ICMP DF bit set must fragment then an
ICMP message is sent back to source, saying I
cant fragment - try again with smaller size.
29User Datagram Protocol - UDP
- RFC 768, Protocol 17
- Provides unreliable, connectionless on top of IP
- Minimal overhead, high performance
- No setup/teardown, 1 datagram at a time
- Application responsible for reliability
- Includes datagram loss, duplication, delay,
out-of-sequence, multiplexing, loss of
connectivity
Demux on Port number
Port 1
Port 2
Port 1
Port 2
App.
Transport
UDP
TCP
Demux on IP protocol
IP
Network
30UDP Datagram format
- Source/destination port port numbers identify
sending receiving processes - Port number IP address allow any application in
any computer on Internet to be uniquely
identified - Used to demultiplex datagrams to processes
- Ports can be static or dynamic
- Static (lt 1024) assigned centrally, known as well
known ports - Dynamic
- Message length in bytes includes the UDP header
and data
31UDP applications
- Message oriented, e.g. SNMP, DNS, time
- File system, e.g. NFS, AFS
- Lightweight file transfer, e.g. tftp, bootp
32Transmission Control Protocol -TCP
- RFC 768 host requirements RFC 1122
- Reliable stream transport
- Connection oriented (full duplex virtual circuit)
- Conceptually place call, two ends communicate to
agree on details - After agreeing application notified of connection
- During transfer, ends communicate continuously to
verify data received correctly - When done, ends tear down the connection
- If UDP is like regular mail, TCP is like phone
call - Provides buffering and flow control
- Takes care of lost packets, out of order,
duplicates, long delays - Isolates application program from network details
- Jargon
- Segment TCP packet
- Socket source (address port) destination
(address port)
33TCP layering
- To ID connection need
- Source (address, port) AND Destination
(address, port) - Only need one port on host to allow multiple
connections, since each connection will have
different (host, port) at other end - E.g. single host can serve multiple telnet
connections - Passive open application contacts OS
indicates will accept incoming connection, OS
assigns port and listens - Active open application requests OS to connect
to an (host, port)
Port 1
Port 2
Port 1
Port 2
App.
Demux on Port number
Transport
UDP
TCP
Demux on IP protocol
IP port 6
IP
Network
34TCP providing reliability
- Positive acknowledgement (ACK) with
retransmission - Sender keeps record of each packet sent
- Sender awaits an ACK
- Sender starts timer when sends packet
Receiver site
Sender site
Send pkt 1
Rcv pkt 1
Send ACK 1
Time
Rcv ACK 1
Send pkt 2
Rcv pkt 2
Send ACK 2
Rcv ACK 2
Network messages
35TCP simple lost packet recovery
Sender site
Receiver site
Loss
Send pkt 1 Start timer
Pkt should arrive
ACK should be sent
ACK normally arrives
Timer expires
Retransmit pkt 1 start timer
Rcv pkt 1
Send ACK 1
Rcv ACK 1
Network messages
36TCP improving performance
- BUT simple ACK protocol wastes bandwidth since it
must delay sending next packet until it gets ACK - Use sliding window
- Sender can send 4 packets of data without ACK
- When sender gets ACK then can send another packet
- Window unacknowledged packets/bytes
- Keeps timer for each packet
Window slides
Initial window of 4 packets
Packets to be sent
Packets successfully sent
Packets sent, awaiting ACK
37Tuning to fill pipe
- Optimal window size depends on
- Bandwidth end to end, i.e. min(BWlinks) AKA
bottleneck bandwidth - Round Trip Time (RTT)
- For TCP keep pipe full
- Window (sometime called pipe) RTTBW
- Can increase bandwidth by
- orders of magnitude
- Windows also used for flow control
Src
Rcv
38Implementation
- Sliding window operates at byte level, NOT packet
- Receiver keeps similar window to put stream back
together - Since full duplex, altogether 4 windows pointer
sets
Current window
Highest byte that can be sent
3 pointers
Highest byte sent
Bytes sent and acknowledged
39TCP flow control
- Windows vary over time
- Receiver advertises (in ACKs) how many it can
receive - Based on buffers etc. available
- Sender adjusts its window to match advertisement
- If receiver buffers fill, it sends smaller
adverts - Used to match buffer requirements of receiver
- Also used to address congestion control (e.g. in
intermediate routers)
40TCP Segment format
8
16
24
- Source/Dest port TCP port numbers to ID
applications at both ends of connection - Sequence number ID position in senders byte
stream
0
31
4
10
Source port
Destination port
Sequence number
Acknowledgement number
Hlen
Resv
Code
Window
Urgent ptr
Checksum
Options (if any)
Padding
Data if any
41TCP segment format cont.
- Acknowledgement identifies the number of the
byte the sender of this segment expects to
receive next - Hlen specifies the length of the segment header
in 32 bit multiples. If there are no options, the
Hlen 5 (20 bytes) - Reserved for future use, set to 0
- Code used to determine segment purpose, e.g.
SYN, ACK, FIN, URG
42TCP Segment format- cont
- Window Advertises how much data this station is
willing to accept. Can depend on buffer space
remaining. - Checksum Verifies the integrity of the TCP
header and data. It is mandatory. - Urgent pointer used with the URG flag to
indicate where the urgent data starts in the data
stream. Typically used with a file transfer abort
during FTP or when pressing an interrupt key in
telnet. - Options used for window scaling, SACK,
timestamps, maximum segment size etc.
43TCP timeout
- Need a timeout estimate that will work for LANs
(RTT lt msec.) to satellite WANs (hundreds of
msec. to secs). RTT can vary a lot with time of
day, day of week, or one second to next. - TCP records time segment sent
- and time ACK received
- Then calculates RTT sample
- Smooth use to estimate timeout, e.g.
- Timeoutbeta RTTs
- Timeout RTTs eta4f(dev(RTTs))
- Needs to take account of losses, e.g.
- New_timeoutgamma2 timeout
May 12th
RTT ms.
Time of day
44TCP connection establishment
- 3 way handshake
- Initial sequence numbers (x, y) are chosen
randomly - Guarantees both sides ready know it, and sets
initial sequence numbers, also sets window mss - Once connection established, data can flow in
both directions, equally well, there is no master
or slave
Site 2
Site 1
Active Win 4096, mss 1024
Send SYN seq x
Rcv SYN segment
Passive Win 4096, mss 1024
Send SYN seqy, ACK x1
Rcv SYN/ACK
Send ACK y1
Rcv ACK segment
45TCP close connection
- Modified 3 way handshake (or 4 way termination)
- App tells TCP to close, TCP sends remaining data
waits for ACK, then sends FIN - Site 2 TCP ACKs FIN, tells its application end
of data - Site 2 sends FIN when its app closes connection
(may be long delay (e.g. require human
interaction).
Site 1
Site 2
(App closes) Send FIN seqx
Rcv FIN segment
Send ACK x1 (inform app)
Rcv ACK segment
(app closes connection) Send FIN seqy, ACK x1
Rcv FIN ACK seg Send ACK y1
Receive ACK segment
46More Information
- Lectures, tutorials etc
- www.nv.cc.va.us/home/joney/tcp_ip.htm
- www.cs.pdx.edu/jrb/tcpip.lectures.html
- www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200
/CCONTENTS - www.cisco.com/univercd/cc/td/doc/product/iaabu/cen
tri4/user/scf4ap1.htm - www.cis.ohio-state.edu/htbin/rfc/rfc1180.html
- www.jbmelectronics.com/tcp.htm
- Encylopaedia
- http//www.freesoft.org/CIE/index.htm
- TCP/IP Resources
- www.private.org.il/tcpip_rl.html
- Understanding IP addresses
- http//www.3com.com/solutions/en_US/ncs/501302.htm
l - Configuring TCP (RFC 1122)
- ftp//nic.merit.edu/internet/documents/rfc/rfc1122
.txt - Assigned protocols, ports etc (RFC 1010)
- http//www.es.net/pub/rfcs/rfc1010.txt
/etc/protocols
47Example 3 way handshake
- atlasgt telnet sunstats.cern.ch
- atlas is a WNT PC, sunstats is a Sun Solaris 5.6
host - MSS is set in TCP option in a SYN segment,
communicates the MSS the sender wants to receive - lenip_hlen/tcp_hlenip_total_len
- Initial Sequence Numbers are randomly selected
- Telnet port 23
- WReceive window size advertises how much data
this host will accept
48Example 3 way handshake - cont.
- TCP from atlas1174 to sunstats23 seq180839,
A0, W8192, SYN len5/644, opt020405B4
ltopt2, len4, mss0x5B41460gt - TCP from sunstats23 to atlas1174
seq1383568304, A180840, W64240, SYN/ACK
len5/644, opt020405B4 - TCP from atlas1174 to sunstats23 seq 180840,
A1383568305, W8760 len5/540, optnul - Notice window size can vary from segment to
segment depending on buffer space available - Notice smaller PC window advertisement
- Notice ephemeral port selected by telnet client
- Notice acknowledge next expected byte (seq1)
- 0x020405B4 02 option type, 04len, 0x5B41460
49Session start
SLACgtCERN 256kbyte window,1 stream, full speed
gt 30msec, 13MBytes in 20s, 5.1MBytes/s
Congestion window
Rcvr Advertised window
Segments sent
Acks returned by Rcvr