Transport Layer Enhancements for Unified Ethernet in Data Centers presentation

About This Presentation

Transcript and Presenter's Notes

Title: Transport Layer Enhancements for Unified Ethernet in Data Centers

1
Transport Layer Enhancements for Unified Ethernet
in Data Centers

K. Kant
Raj Ramanujan
Intel Corp

Exploratory work only, not a committed Intel
position
2
Context

Data center is evolving
? Fabric should too.
Last talk
Enhancements to Ethernet, already on track
This talk
Enhancements to Transport Layer
Exploratory, not in any standards track.

3
Outline

Data Center evolution transport impact
Transport deficiencies remedies
Many areas of deficiencies
Only Congestion Control and QoS addressed in
detail
Summary Call to Action

4
Data Center Today
IPC Fabric
Storage Fabric
database query
client req/ resp
business trans
network Fabric
SAN storage

Tiered structure
Multiple incompatible fabrics
Ethernet, Fiber Channel, IBA, Myrinet, etc.
Management complexity
Dedicated servers for applications ? Inflexible
resource usage

5
Future DC Stage 1 Fabric Unification
iSCSI storage
database query
client req/ resp
business trans

Enet dominant, but convergence really on IP.
New layer2 PCI-Exp, Optical, WLAN, UWB,
Most ULPs run over transport over IP
? Need to comprehend transport implications

6
Future DC Stage 2 Clustering Virtualization

SMP ? Cluster (cost, flexibility, )
Virtualization
Nodes, network, storage, ? Virtual clusters
(VC)
Each VC may have multiple traffic types inside

Virtual Cluster1
IP ntwk
Virtual Cluster 2
Virtual Cluster 3
7
Future DC New Usage Models

Dynamically provisioned virtual clusters
Distributed storage (per node)
Streaming traffic (VoIP/IPTV data services)
HPC in DC
Data mining for focused advertising, pricing,
Special purpose nodes
Protocol accelerators (XML, authentication, etc.)
New models ? New fabric requirements

8
Fabric Impact

More types of traffic, more demanding needs.
Protocol impact at all levels
Ethernet Previous presentation.
IP Change affects entire infrastructure.
Transport This talk
Why transport focus?
Change primarily confined to endpoints.
Many app needs relate to transport layer
App. interface (Sockets/RDMA) mostly unchanged.
DC evolution ? Transport evolution

9
Transport Issues enhancements

Transport (TCP) enhancement areas
Better Congestion control and QoS
Support media evolution
Support for high availability
Many others
Message based unordered data delivery.
Connection migration in virtual clusters.
Transport layer multicasting.
How do we enhance transport?
New TCP compatible protocol?
Use an existing protocol (SCTP)?
Evolutionary changes to TCP from DC perspective.

10
Whats wrong with TCP Congestion control

TCP congestion control (CC) works independently
for each connection ?
By default TCP equalizes throughput ? undesirable
Sophisticated QoS can change this, but
Lower level CC ? Backpressure on transport
Transport layer congestion control is crucial

Cong feedback
TL cong cntrl
ECN/ICMP
MAC
MAC
switch
switch
router
11
Whats wrong with QoS?

Elaborate mechanisms
Intserv (RSVP), Diffserv, BW broker,
But a nightmare to use
App knowledge, many parameters, sensitivity,
What do we need?
Simple/intuitive parameters
e.g., streaming or not, normal vs. premium, etc.
Automatic estimation of BW needs.
Application focus, not flow focus!
QoS relevant primarily under congestion
? Fix TCP congestion control, use IP QoS
sparingly.

12
TCP Congestion Control Enhancements

Collective control of all flows of an app
Applicable to both TCP UDP
Ensures proportional fairness of multiple
inter-related flows
Tagging of connections to identify related flows.
Packet loss highly undesirable in DC
Move towards a delay based TCP variant.
Multilevel Coordination
Socket vs. RDMA apps, TCP vs. UDP,
A layer above transport for coordination

13
Collective Congestion Control
Cong. Control

Control connections thru a congested device
together (control set)
Determining control set is challenging
BW requirement estimated automatically during
non-congested periods

SW0
S23
S13
SW2
SW1
S21
S11
CL2
CL1
14
Sample Collective Control

App 1 client1 ? server1
Database queries over a single connection
? Drives 5.0 Mb/s BW
App2 client2 ? server1
Similar to App1
? Drives 2.5 Mb/s BW
App 3 client3 ? server2
FTP, starts at t30 secs
? 25 conn. ? 8 Mb/s

15
Sample Results
Cong. Control

Modified TCP can maintain 21 throughput ratio
Also yields lower losses smaller RTT.

Collective control highly desirable within a DC
16
Adaptation to Media

Problem TCP assumes loss ? congestion, and
designed for WAN (high loss/delay)
Effects
Wireless (e.g. UWB) attractive in DC (wiring
reduction, mobility, self configuration).
but TCP is not a suitable transport.
Overkill for communications within a DC.
Solution A self-adjusting transport
Support multiple congestion/flow-control regimes.
Automatically selected during connection setup.

17
High Availability Issues

Problem Single failure ? broken connection, weak
robustness check,
Effect Difficult to achieve high availability.

Solution
Multi-homed connections w/ load sharing among
paths.
Ideally, controlled diversity path management
Difficult need topology awareness, spanning tree
problem,

18
Summary call to action

Data Centers are evolving
Transport must evolve too, but a difficult
proposition
TCP is heavily entrenched, change needs an
industry wide effort
Call to Action
Need to get an industry effort going to define
New features their implementation
Deployment compatibility issues.
Change will need push from data center
administrators planners.

19
Additional Resources

Presentation can be downloaded from the IDF web
site when prompted enter
Username idf
Password fall2005
Additional backup slides
Several relevant papers available at
http//kkant.ccwebhost.com/download.html
Analysis of collective bandwidth control.
SCTP performance in data centers.

20
Backup
21
Comparative Fabric Features
DC requirements
Feature TCP SCTP IBA
Scalability to 100 Gb/s difficult difficult Easy?
Message based ULP support No Yes Yes
QoS friendly transport? No No Yes
Virtual channel support No No yes
DC centric flow/cong. control No No Yes
Point to multipoint communication No No Yes
High availability features Poor Fair Good
Offload latency (end-pt only) 1us gt1us lt.5us
Compatible w/ TCP/IP base Yes limited
Unordered data delivery No Yes Yes
Protection against DoS attacks Poor Good Poor
Multiple traffic streams No Yes Yes
TCP lacks many desirable features SCTP has some
22
Transport Layer QoS

Needed at multiple levels
Between transport uses
Conn. of a given transport
Logical streams

Inter-app
Web app
DB App

May be on two VMs on
same physical machine.

ntwk
IPC
page
iSCSI
Intra-app

Best BW subdivision to maximize performance?

text
images
cntrl
data
Intra-conn
Intra-conn

Requirements
Must be compatible with lower level QoS
PCI-Exp, MAC, etc.
Automatic estimation of bandwidth requirements
Automatic BW control

23
Multicasting in DC

Software/patch distribution
Multicast to all machines w/ same version.
Characteristics
Medium to large file transfer
Time to finish matters, BW doesnt.
Scale 10s to 1000s.
High performance computing
MPI collectives need multicasting
Characteristics
Small but frequent transfers
Latency premium, BW not an issue mostly.
Scale 10s to 100s

24
Transport layer multicasting
TL multicasting
DC needs IP multicasting TL multicasting
Legacy infrast. Needs specialized routers Std. routers adequate
Short msgs, dynamic group Usually designed for long transfers Appropriate mechanism?
Topology aware? Yes (routing alg. based) No (Need new mechnisms)
Low overhead No (Complex mgmnt) Simpler, done in TL engine
Low latency Primarily BW focussed Need latency centric design
Reliable mcast. Built on top Part of TL
25
TL multicasting value

Assumptions
A 16 node cluster w/ 4-node subclusters.
Mcast group 2 nodes in each sub-cluster
Latencies
endpt 2 us, ack proc 1 us, switch 1 us
App-TL interface 5 us
Latency w/o mcast
send 7x2 3x1 2 19 us
ack 1 3x1 7x1 11 us
reply 5 2 7x2 21 us
Total 191121 51 us
Latency w/ mcast
send 3x2 3x1 2 2x(11) 2 17 us
ack 1 1 2x1 3x1 3x1 10 us
Total 17 10 5 32 us
Larger savings in full network mcast.

26
Hierarchical Connections

Choose a leader in each subnet.
Topology directed
Multicast connections to others nodes via leaders
?
Ack consolidation at leaders (multicast)
Msg consolidation at leaders (reverse multicast)
Done by a layer above? (layer 4.5?)

Write a Comment

User Comments (0)

About PowerShow.com

Transport Layer Enhancements for Unified Ethernet in Data Centers PowerPoint PPT Presentation