Transport Layer Enhancements for Unified Ethernet in Data Centers PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Transport Layer Enhancements for Unified Ethernet in Data Centers


1
Transport Layer Enhancements for Unified Ethernet
in Data Centers
  • K. Kant
  • Raj Ramanujan
  • Intel Corp

Exploratory work only, not a committed Intel
position
2
Context
  • Data center is evolving
  • ? Fabric should too.
  • Last talk
  • Enhancements to Ethernet, already on track
  • This talk
  • Enhancements to Transport Layer
  • Exploratory, not in any standards track.

3
Outline
  • Data Center evolution transport impact
  • Transport deficiencies remedies
  • Many areas of deficiencies
  • Only Congestion Control and QoS addressed in
    detail
  • Summary Call to Action

4
Data Center Today
IPC Fabric
Storage Fabric
database query
client req/ resp
business trans
network Fabric
SAN storage
  • Tiered structure
  • Multiple incompatible fabrics
  • Ethernet, Fiber Channel, IBA, Myrinet, etc.
  • Management complexity
  • Dedicated servers for applications ? Inflexible
    resource usage

5
Future DC Stage 1 Fabric Unification
iSCSI storage
database query
client req/ resp
business trans
  • Enet dominant, but convergence really on IP.
  • New layer2 PCI-Exp, Optical, WLAN, UWB,
  • Most ULPs run over transport over IP
  • ? Need to comprehend transport implications

6
Future DC Stage 2 Clustering Virtualization
  • SMP ? Cluster (cost, flexibility, )
  • Virtualization
  • Nodes, network, storage, ? Virtual clusters
    (VC)
  • Each VC may have multiple traffic types inside

Virtual Cluster1
IP ntwk
Virtual Cluster 2
Virtual Cluster 3
7
Future DC New Usage Models
  • Dynamically provisioned virtual clusters
  • Distributed storage (per node)
  • Streaming traffic (VoIP/IPTV data services)
  • HPC in DC
  • Data mining for focused advertising, pricing,
  • Special purpose nodes
  • Protocol accelerators (XML, authentication, etc.)
  • New models ? New fabric requirements

8
Fabric Impact
  • More types of traffic, more demanding needs.
  • Protocol impact at all levels
  • Ethernet Previous presentation.
  • IP Change affects entire infrastructure.
  • Transport This talk
  • Why transport focus?
  • Change primarily confined to endpoints.
  • Many app needs relate to transport layer
  • App. interface (Sockets/RDMA) mostly unchanged.
  • DC evolution ? Transport evolution

9
Transport Issues enhancements
  • Transport (TCP) enhancement areas
  • Better Congestion control and QoS
  • Support media evolution
  • Support for high availability
  • Many others
  • Message based unordered data delivery.
  • Connection migration in virtual clusters.
  • Transport layer multicasting.
  • How do we enhance transport?
  • New TCP compatible protocol?
  • Use an existing protocol (SCTP)?
  • Evolutionary changes to TCP from DC perspective.

10
Whats wrong with TCP Congestion control
  • TCP congestion control (CC) works independently
    for each connection ?
  • By default TCP equalizes throughput ? undesirable
  • Sophisticated QoS can change this, but
  • Lower level CC ? Backpressure on transport
  • Transport layer congestion control is crucial

Cong feedback
TL cong cntrl
ECN/ICMP
MAC
MAC
switch
switch
router
11
Whats wrong with QoS?
  • Elaborate mechanisms
  • Intserv (RSVP), Diffserv, BW broker,
  • But a nightmare to use
  • App knowledge, many parameters, sensitivity,
  • What do we need?
  • Simple/intuitive parameters
  • e.g., streaming or not, normal vs. premium, etc.
  • Automatic estimation of BW needs.
  • Application focus, not flow focus!
  • QoS relevant primarily under congestion
  • ? Fix TCP congestion control, use IP QoS
    sparingly.

12
TCP Congestion Control Enhancements
  • Collective control of all flows of an app
  • Applicable to both TCP UDP
  • Ensures proportional fairness of multiple
    inter-related flows
  • Tagging of connections to identify related flows.
  • Packet loss highly undesirable in DC
  • Move towards a delay based TCP variant.
  • Multilevel Coordination
  • Socket vs. RDMA apps, TCP vs. UDP,
  • A layer above transport for coordination

13
Collective Congestion Control
Cong. Control
  • Control connections thru a congested device
    together (control set)
  • Determining control set is challenging
  • BW requirement estimated automatically during
    non-congested periods

SW0
S23
S13
SW2
SW1
S21
S11
CL2
CL1
14
Sample Collective Control
  • App 1 client1 ? server1
  • Database queries over a single connection
  • ? Drives 5.0 Mb/s BW
  • App2 client2 ? server1
  • Similar to App1
  • ? Drives 2.5 Mb/s BW
  • App 3 client3 ? server2
  • FTP, starts at t30 secs
  • ? 25 conn. ? 8 Mb/s

15
Sample Results
Cong. Control
  • Modified TCP can maintain 21 throughput ratio
  • Also yields lower losses smaller RTT.

Collective control highly desirable within a DC
16
Adaptation to Media
  • Problem TCP assumes loss ? congestion, and
    designed for WAN (high loss/delay)
  • Effects
  • Wireless (e.g. UWB) attractive in DC (wiring
    reduction, mobility, self configuration).
  • but TCP is not a suitable transport.
  • Overkill for communications within a DC.
  • Solution A self-adjusting transport
  • Support multiple congestion/flow-control regimes.
  • Automatically selected during connection setup.

17
High Availability Issues
  • Problem Single failure ? broken connection, weak
    robustness check,
  • Effect Difficult to achieve high availability.
  • Solution
  • Multi-homed connections w/ load sharing among
    paths.
  • Ideally, controlled diversity path management
  • Difficult need topology awareness, spanning tree
    problem,

18
Summary call to action
  • Data Centers are evolving
  • Transport must evolve too, but a difficult
    proposition
  • TCP is heavily entrenched, change needs an
    industry wide effort
  • Call to Action
  • Need to get an industry effort going to define
  • New features their implementation
  • Deployment compatibility issues.
  • Change will need push from data center
    administrators planners.

19
Additional Resources
  • Presentation can be downloaded from the IDF web
    site when prompted enter
  • Username idf
  • Password fall2005
  • Additional backup slides
  • Several relevant papers available at
    http//kkant.ccwebhost.com/download.html
  • Analysis of collective bandwidth control.
  • SCTP performance in data centers.

20
Backup
21
Comparative Fabric Features
DC requirements
Feature TCP SCTP IBA
Scalability to 100 Gb/s difficult difficult Easy?
Message based ULP support No Yes Yes
QoS friendly transport? No No Yes
Virtual channel support No No yes
DC centric flow/cong. control No No Yes
Point to multipoint communication No No Yes
High availability features Poor Fair Good
Offload latency (end-pt only) 1us gt1us lt.5us
Compatible w/ TCP/IP base Yes limited
Unordered data delivery No Yes Yes
Protection against DoS attacks Poor Good Poor
Multiple traffic streams No Yes Yes
TCP lacks many desirable features SCTP has some
22
Transport Layer QoS
  • Needed at multiple levels
  • Between transport uses
  • Conn. of a given transport
  • Logical streams

Inter-app
Web app
DB App
  • May be on two VMs on
  • same physical machine.

ntwk
IPC
page
iSCSI
Intra-app
  • Best BW subdivision to maximize performance?

text
images
cntrl
data
Intra-conn
Intra-conn
  • Requirements
  • Must be compatible with lower level QoS
  • PCI-Exp, MAC, etc.
  • Automatic estimation of bandwidth requirements
  • Automatic BW control

23
Multicasting in DC
  • Software/patch distribution
  • Multicast to all machines w/ same version.
  • Characteristics
  • Medium to large file transfer
  • Time to finish matters, BW doesnt.
  • Scale 10s to 1000s.
  • High performance computing
  • MPI collectives need multicasting
  • Characteristics
  • Small but frequent transfers
  • Latency premium, BW not an issue mostly.
  • Scale 10s to 100s

24
Transport layer multicasting
TL multicasting
DC needs IP multicasting TL multicasting
Legacy infrast. Needs specialized routers Std. routers adequate
Short msgs, dynamic group Usually designed for long transfers Appropriate mechanism?
Topology aware? Yes (routing alg. based) No (Need new mechnisms)
Low overhead No (Complex mgmnt) Simpler, done in TL engine
Low latency Primarily BW focussed Need latency centric design
Reliable mcast. Built on top Part of TL
25
TL multicasting value
  • Assumptions
  • A 16 node cluster w/ 4-node subclusters.
  • Mcast group 2 nodes in each sub-cluster
  • Latencies
  • endpt 2 us, ack proc 1 us, switch 1 us
  • App-TL interface 5 us
  • Latency w/o mcast
  • send 7x2 3x1 2 19 us
  • ack 1 3x1 7x1 11 us
  • reply 5 2 7x2 21 us
  • Total 191121 51 us
  • Latency w/ mcast
  • send 3x2 3x1 2 2x(11) 2 17 us
  • ack 1 1 2x1 3x1 3x1 10 us
  • Total 17 10 5 32 us
  • Larger savings in full network mcast.

26
Hierarchical Connections
  • Choose a leader in each subnet.
  • Topology directed
  • Multicast connections to others nodes via leaders
    ?
  • Ack consolidation at leaders (multicast)
  • Msg consolidation at leaders (reverse multicast)
  • Done by a layer above? (layer 4.5?)
Write a Comment
User Comments (0)
About PowerShow.com