Low Latency Networking - PowerPoint PPT Presentation

About This Presentation

Title:

Low Latency Networking

Description:

Low Latency Networking. Glenford Mapp. Digital Technology Group. Computer Laboratory ... centred around using fibre technology throughout the building ... – PowerPoint PPT presentation

Number of Views:149

Avg rating:3.0/5.0

Slides: 57

Provided by: Glenfo4

Category:

more less

Transcript and Presenter's Notes

Title: Low Latency Networking

1
Low Latency Networking

Glenford Mapp
Digital Technology Group
Computer Laboratory
http//www.cl.cam.ac.uk/Research/DTG/gem11

2
What is Latency?

The time taken to send a unit of data between two
points in a network
A low latency network is a network in which the
design of the hardware, systems and protocols are
geared towards minimizing the time taken to move
units of data between any two points on that
network

3
Throughput

Number of bytes of data that is transferred per
second between two points
Doesnt high throughput imply low latency?
Not necessarily
A bus vs a car travelling along a section of road
Which has the higher throughput?
Which has the lower latency?

4
Throughput vs Latency

In simplest form,
Throughput C / Latency
C instantaneous capacity
Number of units that are handled per operation
So if C is large you can get good throughput even
if your latency is not low
Low latency does not necessarily imply high
throughput if C also gets smaller
ATM is a good example

5
Throughput Claims

Look carefully at high throughput claims.
Have they decreased the latency
Per unit operation is faster
Software -gt Hardware (ATM)
Have they increased instantaneous capacity
Serial -gt Parallel-Parallel-gtSerial
In most designs we have a mixture of both
Manufacturers will generally allow increased
latency if capacity greatly increases

6
Who cares about latency?

Why is latency important?
Some applications are more affected by latency
rather than throughput
Voice
Also affected by jitter
Networked Games
Interactive sessions

7
Lessons from Computers

Consider the Mainframe in the time-sharing era.
1963-1976
Studies showed that user productivity reduced by
half if the response time from mainframe
increases from 0.5 to 3 seconds
Mainframe optimised for throughput
Maximize the number of people using it
High throughput

8
Lessons from Computers

But as more people logged on the slower the
machine became and by noon the response time
would increase markedly so user productivity
would fall
Key factor in the development of PCs
Famous saying
I love the Alto (first PC) because it does not
run faster at night!

9
A look at the Internet

Not really designed for low latency
Designed to be adaptable and robust
But the new applications we want the Internet to
support need low latency
Web servers
Voice over IP
Networked Games, etc

10
Components of Network Latency

Hardware
Different hardware capacities and limitations
Ethernet variable packet size max 1500
ATM 53 bytes uses fixed cells
Network Routers and Switches
Queueing strategies
Overload/ Congestion strategy

11
Components of Network Latency

System Latency
Moving the packet between the application and the
network interface
OS latency
The operating system handling the packet
Application Latency
Application must acquire resources (e.g. CPU) in
order to send or consume data

12
Traditional Networking A closer look

Look at a packet being received by the host
machine and delivered up to the application
At the lowest level, packet enters the network
interface card (NIC) ends up in a buffer or
fifo on the card. Card generates an interrupt.

13
Tradition Networking contd

Interrupt Handler runs, data is moved into a
system buffer in main memory.
Packet is placed on a receive queue
In Linux there is one network receive queue
Packets from all the network interfaces are
placed on that queue
Packet is marked for system processing
Interrupt Handler ends

14
Traditional Networking contd

System processing
Packet is taken up the protocol stack
IP processing TCP processing
Connection information associated with the packet
is used to find the corresponding socket
Socket Src (IPaddr, TCP port) , Dest (IPaddr,
TCP port)

15
Traditional Networking contd

Queue the packet on the socket structure and see
if any application threads are waiting for
incoming data
If so, copy the data from system buffer to the
user buffer and wake up the thread
Application has to wait until it gets the CPU to
consume data

16
Analysis of Traditional Networking

Interrupt systems potentially infinite latency
Processing of packets in the queue is affected by
the rate of incoming packets
Copying data adds to latency
OS sits between two worlds
It de-multiplexes the packet and decides its
final destination
It also ensures that the relevant application is
scheduled to receive the data. This is called
application synchronisation

17
APPLICATION LAYER
Socket Interface
Socket layer in OS
System Buffers
System Buffers
NIC
Network
18
Cross Talk Issues

Interrupt level
while an application is running on the processor,
network interrupts occur on incoming packets for
other processes.
Protocol level
packets for all applications are multiplexed and
de-multiplexed in the kernel
Application Level
All applications must share resources so
sometimes I must wait a long time before I get
the processor.

19
Some ways to improve Traditional Networking

User level network interfaces
UNET - Matt Walsh (1995-1998)
Zero copy architectures
Virtual memory mapping techniques
Vertical Partitioning of Operating Systems

20
UNET

Application has an interface to talk directly to
a network device
Doesnt involve the kernel in things like
protocol processing, etc.
Uses per application message queues to send and
receive data
Novel idea at the time
complicates what applications need to do

21
UNET Endpoint
Communication segment
Send queue
Free queue
Recv queue
22
Zero-Copy Architecture

No need to copy data up to the application
DMA from network buffers in NIC card straight
into system buffers
Use VM techniques to map the relevant system
buffers into the address space of the application

23
Vertical Partitioning of the OS

So UNET gave applications an abstract network
card so there was less multiplexing of data.
Why not go all the way and do more partitioning
of OS resources
So CPU is carefully partitioned, file systems and
disk devices also carefully partitioned

24
Pegasus project - Cambridge

Studied system support for multimedia
applications
Developed a new operating system called Nemesis
which adopted a vertical approach
Most of the operating system functions were in
shared libraries which executed in the users
process space
System-wide page table, so no copying

25
Vertical Approach
Processes
Normal OS
Shared Libraries
26
Why havent these ideas been universally
implemented

Some were explored
VIA is a hardware idea based on UNET
Replace PCI bus
Devices have receive, send and completion queues
and are connected along a high-speed serial bus
One or two products out there but fell out of
favour
Infiniband - now popular extension of VIA

27
Ideas not universal

Zero copy and VM ideas explored in some Operating
Systems, e.g. the Spring OS by Sun. Some ideas
made their way into Solaris. Windows 2000 and XP,
via Mach and NT
Nemesis was too radical for prime time
QoS ideas have been taken up by others

28
But the real reason was..

That processor and network speeds have been
increasing fast enough to keep traditional
networking in the picture.
If you simply want to browse the Web and read
email, then it is OK
However, there is a looming problem

29
Network speeds still going up!

We have gone from 10 Mbps in 1987 to 10G in 2004
and beyond.
Processor not be able to keep up
Interrupt rate is phenomenal
Buses like the PCI bus cannot keep up
Move to PCI Express (Switch Fabric)
Workstation can presently saturate the network
but the tide is rapidly turning!
Network traffic will soon be able to cripple your
PC

30
Need a system that is less interrupt-dependent

Two main approaches
No OS processing whatsoever
including no interrupts
data is moved by hardware
OS is used to setup where the data is moved to
Apply more processing power but target it on the
network interface

31
Shared Memory Model

Data transfer is accomplished by writing to
memory addresses in the local address space of
the process
This data is captured by the local network card
and serialized into packets which are transferred
over the network to the remote machine which
writes the data to remote addresses.

32
How does it actually work?

A region of the local address space of the
process is mapped to an IO region on the card.
That mapping is usually made using standard
memory-mapping techniques.
In Unix the mmap call is used.
Same thing is done on the remote side

33
Shared Memory Model
Process VM
Process VM
NIC
NIC
packets
34
How is the association between the local and
remote regionsmade

Fixed
In early SMMs, it was fixed.
All processors on the network share the same
region.
Flexible
Needs a communications channel to set up the
mapping between regions

35
Fixed SMM
Process VM space
Proc A
Proc B
Proc C
Proc D
36
Dynamic SMM
Process VM space
Proc D
Proc B
Proc C
Proc A
37
SMM

Been around a long time
Used to communicate between processors in a
cluster.
The SMM is divided into pages, some of which can
be mapped between two processes and the other set
can be mapped globally

38
Problems with SMM

Since no interrupts are involved and the OS is no
longer in the loop, its hard to inform the
remote node that data has been sent and is
waiting to be read
Major problem is therefore not the transfer, but
application synchronization

39
Applications SynchronizationSolutions

Polling
the receiver keeps polling certain addresses to
see if a data transfer has occurred
This is expensive (wasting local CPU) and only
relevant if there is a real chance of a data
transfer.
Could be used to provide to provide a form of
distributed synchronization - spinning on a
remote address

40
Application Synchronization Solutions

VM signalling
Pagefault or access violations
Example page is only mapped locally when there
is data to be read. If I access the page when
there is no data, then a pagefault occurs and I
am blocked until the owner writes to the page

41
VM Signalling

If I wish to read and there is data to be read
then the page is mapped into my address space
read-only.
If I attempt to write to the page, a pagefault
occurs and I am blocked until I can acquire the
write lock for the page
Not scalable, too closely coupled to the VM system

42
Out-of-Band signaling

Use a separate channel outside the data transfer
region to signal that data has been transferred.
For example, writing to a special set of
addresses would cause an interrupt to be
generated at the remote end

43
Out-of-Band Signalling

So you would transfer the data by writing to your
local address
After you then wrote to a special address
associated with that memory region
An interrupt occurs on the other side and the OS
works out which buffer you are referring to and
wakes up the waiting process

44
Out-of-Band Signalling

Out-of-Band Signalling still involves the
processor to achieve application synchronization
Adds the overall transfer latency
Ex. Memory Channel
data transfer 2.9 us
acquire spin lock 120 us
Increases the expense of the NIC

45
History of SMM

Used to be extremely proprietary
DEC Memory Channel best known
Used a fixed shared memory region of 512 MB
divided into 64K pages each page being 8K
Very versatile, can share pages between one or
more processes. Use broadcast facilities
Average latencies 10-25 us

46
SCI - Scalable Coherent Interface

IEEE Standard 1956-1992
Uses high speed unidirectional links
Parallel links 16 bits, 500 Mhz (8 Gbs)
Serial G-Link technology (1Gbs)
Packet-based transfer
header - 16 bytes data 0, 16, 64 or 256 bytes
queue and signal interrupts

47
SCI contd

Can do cache-coherency (optional)
Latency lt 10 us
Modern cards uses 64bit and 66 MHz buses (5.33
Gbits/s)
Big player Dolphin Interconnect
Sun uses their boards to build megaservers

48
Processor Intensive Approach PIA

We offload networking by using a processor on the
NIC
Myrinet - most well-known exponent
Full duplex data links 2 Gbits/s
Bus 64-bit 133Hz PCI-X bus
PC - 255 Mhz RISC Memory

49
Myrinet cont

Packet-based
Header, packet type, payload
Host Computer controls the NIC
runs a MCP program
Myrinet controls around 39 of the cluster market

50
Performance

Latency around 6.3 us
Climbs to over 100 us over 10000 bytes
One way throughput 248 MB/s
Messages over a 1000 bytes
Two way throughput 489 MB/s
Message over 10000 bytes
Throughput between Unix processes on different
hosts
1.98 Gbits (uni) 3.9 Gbits/s (bi)

51
Comparing SCI and Myrinet

Latency are about the same
SCI much faster for cluster of 8 or less
but slows exponentially as the number of PCs
increases
Myrinet is better for large systems gt 64
Software appears more complete with Myrinet

52
Recent developments in Low Latency Systems

Collapsed LAN project (CLAN)
1997 - 2002, ATT Laboratories-Cambridge
project originally centred around using fibre
technology throughout the building
remoting PCs just have mouse, keyboard and
display in your office and put the PC in the
server room
bought some SCI cards and got some systems going

53
CLAN project

Faced the application synchronization problem
Came up with a novel solution called Tripwire
in-band synchronization
an event is signalled on the receiver when data
is written to a special address in the data
region during the data transfer

54
Tripwire
Processes
Tripwire
55
CLAN Project