Title: Infrastructure and Protocols for Dedicated Bandwidth Channels
1Infrastructure and Protocols for Dedicated
Bandwidth Channels
Nagi Rao Computer Science and Mathematics
Division Oak Ridge National Laboratory raons_at_ornl.
gov
March 14, 2005 1st Annul Workshop of Cyber
Security and Information Infrastructure Research
Group (CSIIR) and Information Operations Center
(IOC) Oak Ridge, TN Research Sponsored
by Department of Energy National Science
Foundation Defense Advanced Research Agency
2Collaborators
- Steven Carter, Oak Ridge National Laboratory
- Leon O. Chua, University of California at
Berkeley - Jianbo Gao, University of Florida
- Qishi Wu, Oak Ridge National Laboratory
- William Wing, Oak Ridge National Laboratory
Sponsors
Department of Energy High-Performance Networking
Program National Science Foundation Advanced
Network Infrastructure Program Defense Advanced
Research Agency Network Modeling and Simulation
Program Oak Ridge National Laboratory Laboratory
Directed RD Program
3Outline of Presentation
- Network Infrastructure Projects
- DOE UltraScienceNet
- NSF CHEETAH
- Dynamics and Control of Transport Protocols
- TCP AIMD Dynamics
- Analytical Results
- Experimental Results
- New Class of Protocols
- Throughput Stabilization for Control
- Transport Protocol
- Probabilistic Quickest Path Problem
- Quickest path algorithm
- Probabilistic algorithm
4Outline of Presentation
- Network Infrastructure Projects
- DOE UltraScienceNet
- NSF CHEETAH
- Dynamics and Control of Transport Protocols
- TCP AIMD Dynamics
- Analytical Results
- Experimental Results
- New Class of Protocols
- Throughput Stabilization for Control
- Transport Protocol
- Probabilistic Quickest Path Problem
- Quickest path algorithm
- Probabilistic algorithm
5Motivation for Networking Projects Terascale
Supernova Initiative (TSI) DOE large-scale
science application
- Science Objective Understand supernova
evolutions - DOE SciDAC Project ORNL and 8 universities
- Teams of field experts across the country
collaborate on computations - Experts in hydrodynamics, fusion energy, high
energy physics - Massive computational code
- Terabyte in generated in a day currently
- Archived at nearby HPSS
- Visualized locally on clusters only archival
data - Desired network capabilities
- Archive and supply massive amounts of data to
supercomputers and visualization engines - Monitor, visualize, collaborate and steer
computations
Visualization channel
Visualization control channel
Steering channel
6DOE UltraScience Net
- The Need
- DOE large-scale science applications on
supercomputers and experimental facilities
require high-performance networking - Petabyte data sets, collaborative visualization
and computational steering - Application areas span the disciplinary spectrum
high energy physics, climate, astrophysics,
fusion energy, genomics, and others
- Promising Solution
- High bandwidth and agile network capable of
providing on-demand dedicated channels multiple
10s Gbps to 150 Mbps - Protocols are simpler for high throughput and
control channels
- Challenges Several technologies need to be
(fully) developed - User-/application-driven agile control plane
- Dynamic scheduling and provisioning
- Security encryption, authentication,
authorization - Protocols, middleware, and applications optimized
for dedicated channels
Contacts Bill Wing (wrw_at_ornl.gov) Nagi Rao
(raons_at_ornl.gov)
7DOE UltraScience Net
- Connects ORNL, Chicago, Seattle and Sunnyvale
- Dynamically provisioned dedicated dual 10Gbps
SONET links - Proximity to several DOE locations SNS, NLCF,
FNL, ANL, NERSC - Peering with ESnet, NSF CHEETAH and other
networks
Data Plane User Connections Direct connections
to core switches SONET channels MSPP Ethernet
channels Utilize UltraScience Net hosts
Funded by U. S. DOE High-Performance Networking
Program at Oak Ridge National Laboratory 4.5M
for 3 years
8Control-Plane
- Phase I
- Centralized VPN connectivity
- TL1-based communication with core switches and
MSPPs - User access via centralized web-based scheduler
- Phase II
- GMPLS direct enhancements and wrappers for TL1
- User access via GMPLS and web to bandwidth
scheduler - Inter-domain GMPLS-based interface
Bandwidth Scheduler
- Computes path with target bandwidth
- Is bandwidth available now?
- Extension of Dijkstras algorithm
- Provide all available slots
- Extension of closed semi ring structure to
sequences of reals - Both are polynomial-time algorithms
- GMPLS does not have this capability
Web-based User Interface and API
- Allows users to logon to website
- Request dedicated circuits
- Based on cgi scripts
9NSF CHEETAHCircuit-switched High-speed
End-to-End Transport ArcHitecture
- Objective
- Develop the infrastructure and networking
technologies to support a broad class of eScience
projects and specifically the Terascale Supernova
Initiative. - Main Technical Components
- Optical network testbed
- Transport protocols
- Middleware and applications
- Collaborative Project 3.5M for 3 years
- U. Virginia, ORNL, NC State, CUNY
- Sponsor National Science Foundation
Contacts Malathi Veeraraghavan(mv_at_cs.virginia.edu
) Nagi Rao (raons_at_ornl.gov)
10CHEETAH Project concept
- Network
- Create a network that on-demand offers end-to-end
dedicated bandwidth channels to applications - Operate a PARALLE network to existing high-speed
IP networks NOT AN ALTERNATIVE! - Transport protocols
- Design to take advantage of dedicated and dual
end-to-end paths - IP path and dedicated channel
- eScience Application Requirements
- High-throughput file/data transfers
- Interactive remote visualization
- Remote computational steering
- Multipoint collaborative computation
11CHEETAH Initial Configuration
Implements GMPLS protocols
12Peering UltraScience Net - CHEETAH
- Peering
- Coast-to-coast dedicated channels
- Access to ORNL supercomputers and storage
- Applications
- TSI on larger scale
13Outline of Presentation
- Network Infrastructure Projects
- DOE UltraScienceNet
- NSF CHEETAH
- Dynamics and Control of Transport Protocols
- TCP AIMD Dynamics
- Analytical Results
- Experimental Results
- New Class of Protocols
- Throughput Stabilization for Control
- Transport Protocol
- Probabilistic Quickest Path Problem
- Quickest path algorithm
- Probabilistic algorithm
14Transport Dynamics are Important
- Data Transport High bandwidth for large data
transfers over dedicated channels - maintain suitable sending rate to achieve
effective throughput - Control of end devices Remote control of
visualizations, computations and instruments - Jittery dynamics will destabilize the control
loops - Will not be able to effectively execute
interactive simulations
15Study of Transport Dynamics
- Understanding of transport dynamics
- Analytically showed that TCP-AIMD contains
chaotic regimes - concept of w-update map
- Internet traces are shown to be both chaotic and
stochastic - underlying process is anomalous diffusion.
- Development and tuning of protocols
- Protocols for stable flows of fixed rate ONTCOU
- Based on classical Robbins-Monro method
- Transport protocols with statistical stability
RUNAT - Combination of AIAD and Kiefer-Wolfowitz method
16Complicated TCP AIMD Dynamics - History
- Simulation Results TCP-AIMD exhibits
complicated trajectories - TCP streams competing with each other (Veres and
Boda 2000) - TCP competing with UDP (Rao and Chua 2002)
- Analytical Results (Rao and Chua 2002) TCP-AIMD
has chaotic regimes - Developed state space analysis and Poincare maps
- Internet Measurements (2004) TCP-AIMD traces are
a complicated mixture of stochastic and chaotic
components
- Working Definition of Chaotic Trajectories
- Nearby starting points will result in
trajectories that move far apart - at a rate determined by Lyapunov (0) exponent
- Trajectories are non-periodic for some starting
points - The attractor is geometrically complicated
17Simplified View Dynamics of TCP
Early loss slows throughput
Slow starta
Congestion control1/w
time
time
time
- Transport Control Protocol Outline
- Uses window mechanism to send W bytes/RTT
- Dynamically adjusts W to network and receiver
state - Keeps increasing if no loses
- Keeps shrinking if losses are detected
- Slow start phase
- W increase exponentially until or loss
- Congestion Control
- Additively increase W with delivered packets
- Multiplicatively decrease with loss
18Chaotic Dynamics of TCP
- Competing TCP streams Window dynamics are
chaotic - Hard to predict resemblance to random noise
- Hard to conclude from experiments nearby orbits
move faraway later - Hard to characterize chaotic attractor
- Poincare map of two window sizes
- Two-streams case
- Four streams case
- Veres and Boda (2000) did not rigorously
- establish chaos in a formal sense
- Attractor could have been
- generated by periodic orbit with large period
- We repeated the simulation and found
- only quasi periodic trajectories
19Noisy Nature of TCP(simulation)
Router uniform random drops
TCP source
destination
- Simple random traffic generates complicated
attractors - TCP reacts to network traffic randomness
- Jittery end-to-end delays
- Do not need chaos to generate complicated
attractors - Poincare map of message delay vs. window size
20TCP Competing with UDP (ns-2 simulation)
- As CBR rate is varied
- TCP competing with UDP/CBR at the router
generates a variety of dynamics
2Mb, 10ms,DT
1.7Mb, 10ms,DT
TCP/Reno
Router
sink
2Mb, 10ms,DT
UDP/CBR
W(t)
Poincare phase plot Window-size W(t) vs. pkt
end-to-end delay D(t)
time
W(t)
UDP/CBR1Mbs
D(t)
21TCP Competing with UDP
UDP/CBR 1.0Mbs
UDP/CBR1.75Mbs
UDP/CBR 1.7Mbs
UDP/CBR 0.5Mbs
UDP/CBR 1.7Mbs
22Summary of Our Analytical Results
- State-Space of TCP
- congestion window packet delay
including re-transmits - acknowledgements since last MD
losses inferred since last AI - TCP-AIMD dynamics have two qualitatively
different regimes - Regime one high-lighted in usual TCP literature
- increased with while
- Regime two high-lighted by and
- decreases with
- Its effect and duration is enhanced by network
delay and high buffer occupancy - Trajectories move back and forth between these
two regimes - We define Poincare that updates
w-update map M - M is 1-dimensional if Regime Two is short-lived
- M is 2-dimensional and complicated if Regime Two
is significant - M is qualitatively similar to tent map
generates chaotic trajectories
23Dynamics of Transitions Between Regimes
- map for long TCP transfers
Regime 2
Regime 1
t
t
w
w
Both regimes are unstable Eigenvalue analysis
24M w-update map
- Given value, gives its next updated
values - after some time period (not fixed)
- Regime 1
- Regime 2
-
- depends on the number of dropped packets
- - buffer occupancy at that time
- - delay between source and bottleneck buffer
- Result M is parametrized, and each piece
resembles twisted version of classical tent-map
Rao, Gao and Chua, chapter in Complex Dynamics in
Communications Networks, 2004
25Internet Measurements Joint work with Jianbo Gao
- Question 1 How relevant are previous simulation
and analytical results on chaotic trajectories? - Answer Relevant from an analysis perspective to
certain extent. - Question2 Do actual Internet TCP measurement
exhibit chaotic behavior? - Answer Yes. They are more complicated than
chaotic (deterministic).
26Internet Measurements
- Internet (net100) traces show that TCP-AIMD
dynamics are complicated mixture of chaotic and
stochastic regimes - Chaotic TCP-AIMD dynamics
- Stochastic TCP response to network traffic
- Basic Point TCP Traces collected on all Internet
connections showed complicated dynamics - classical saw-tooth profile is not seen even
once - This is not a criticism against TCP, it was not
intended for smooth dynamics
27Cwnd time series for ORNL-LSU connection
Connection OC192 to Atlanta-Sox Internet2 to
Houston LAnet to LSU
Time series cwndx(t) Collected at 1ms (approx)
resolutions collected using net100 instruments
28Time-Dependent Exponent Plots
Informally, a measure of how separated close-by
states become in time Exponential separation is
characteristic of chaotic regime
Form state vectors of size m from time series
x(t), sampled denoted by x(1), x(2), .
For a two state vectors satisfying
we define time-dependent exponent as
Uniform Random Spread out
Lorenz chaotic Common envelope
29Internet cwnd measurements Both Stochastic and
Chaotic Parts are Dominant
- TCP traces have
- Common envelope chaotic
- Spread out stochastic
- at certain scales
- Observations
- From analysis, chaotic dynamics are from AIMD
- Stochastic component is in response to network
traffic losses and RTT variations
Gao and Rao, IEEE Comm Letters, 2005,in press
30Design of Transport Protocols with Smooth Dynamics
- Observation 1 Avoid AIMD-like behavior to avoid
chaotic dynamics - Challenge Randomness is inherent in Internet
connections will not go away even if protocol
is non-chaotic. - Our Solution Explicitly account for randomness
in the protocol design stochastic approximation
31Throughput Stabilization
- Niche Application Requirement Provide stable
throughput at a target rate - typically much
below peak bandwidth - High-priority channels
- Commands for computational steering and
visualization - Control loops for remote instrumentation
- TCP AIMD is not suited for stable throughput
- Complicated dynamics
- Underflows with sustained traffic
- Important Consideration
- Stochasticity of Internet connections must be
explicitly accounted for
Rao, Wu and Iyengar, IEEE Comm Letters, 2004
32Stochastic Approximation UDP window-based method
Objective adjust source rate to achieve (almost)
fixed goodput at the destination
application Difficulty data packets and acks are
subject to random processes Approach Rely on
statistical properties of data paths
33UDP-Based Framework
Send datagrams and wait for
period Source Sending rate Destination
goodput Loss rate
Goodput regression
Loss regression
34Channel Throughput profile
- Plot of receiving rate as a function of sending
rate - Its precise interpretation depends on
- Sending and receiving mechanisms
- Definition of rates
- For protocol optimizations, it is important to
use its own sending mechanism to generate the
profile - Window-based sending process for UDP datagrams
- Send datagrams in a one step window
size - Wait for time called idle-time or
wait-time - Sending rate at time resolution
- This is an adhoc mechanism facilitated by 1GigE
NIC
35Throughput ProfileThroughput and loss rates vs.
sending rate (window size, cycle time)
Typical day
Christmas day
Peak zone
Stabilization zone
Objective adjust source rate to yield the
desired throughput at destination
36Adaptation of source rate
- Sending process send datagrams and
wait for duration - Adjust the window size
- Adjust cycle-time
- Both are special cases of classical
Robbins-Monroe method
target throughput
noisy estimate
37Performance Guarantees
- Summary
- Stabilization is achieved with a high probability
with a very simple estimation of source rate - Basic result for the general update
- We have
38Internet Measurements
- ORNL-LSU connection (before recent upgrade)
- Hosts with 10 M NIC
- 2000 mile network distance
- ORNL-NYC ESnet
- NYC-DC-Hou Abilene
- HOU-LSU Local n/s
- ORNL-GaTech Connection
- Hosts with GigE NICS
- ORNL-Juniper router 1Gig link
- Juniper- ATL Sox OC192 (1Gig link)
- Sox-GaTech 1Gig link
39ORNL-LSU Connection
40Goodput Stabilization ORNL-LSUExperimental
Results
- Case 1 Target goodput 1.0 Mbps, rate control
through congestion window, a 0.8,
- Case 2. Target goodput 2.0 Mbps, rate control
through congestion window, a 0.8,
Datagram acknowledging time ( ) vs. source
rate (Mbps) goodput (Mbps)
Datagram acknowledging time ( ) vs. source
rate (Mbps) goodput (Mbps)
41Throughput Stabilization ORNL-GaTech
Target goodput 20.0 Mbps, a 0.8, adjust
congestion window size
Target goodput level 2.0 Mbps, a 0.8, ,
adjust sleep time
42RUNAT Reliable UDP-based Network Adaptive
Transport
- Transport protocol
- Maximize connection utilization Track peak
goodput - Uses Keifer-Wolfowitz stochastic approximation to
handle ACKs and losses - Features
- Tailored to random loss rate and RTT
- Segmented rate control
- 3 control zones bottleneck link is
underutilized, saturated, and overloaded - Explicit accounting for random components
- Use stochastic approximation methods based on
goodput estimates - TCP-friendliness
- Rate-increasing and rate-decreasing coefficients
are dynamically adjusted - Adaptable to diverse network environments
- Measurements and control periods are not
constant, but link-specific (use RTT).
Wu and Rao, INFOCOM2005
43Three Zone of Goodput Profile
- Three control zones
- Zone I Adaptive Increase
- Bottleneck link is underutilized
- Low packet loss due to occasional congestion or
transmission errors - Fixed with increasing source rate
- Zone II (transitional) dynamic KWSA method
- Bottleneck link is saturated
- Peak goodput falls within this zone
- SA determines whether to increase or decrease
source rate - Zone III Adaptive Decrease
- Bottleneck link is overloaded
- Large packet loss due to network congestion
- Back off to recover from congestion collapse
Zone II low loss
Stabilize sending rate at
Goodput regression
Zone III high loss
Zone I zero loss
sending rate r
44Segmented Rate Control Algorithm
Loss rate estimate
Basic Idea Control sending rate based on loss
rate estimate to achieve peak goodput
when
when
when
45Convergence Properties of RUNAT
Informal Statement If in zones I or III, it will
exit to zone II If in zone II, it will converge
to maximum throughput
Condition A1 loss statistics vary slowly
Condition A2 loss regression is differentiable
and its derivative is monotonically increasing
with respect to r in Phase II. Result
RUNAT in zone I or III, enters II in a finite
number of steps almost surely In zone II,
RUNAT will almost surely converge to the peak
goodput
46Experimental Results on link between ozy4 (ORNL)
and robot (LSU) - Illustration of microscopic
RUNAT behaviors during transfer of 20MB data
The decrement of source rate upon packet loss is
determined by congestion levels (local loss rate
measurements) and higher congestion
levels result in larger rate drops.
The increment of source rate is determined by
congestion levels (local loss rate measurements)
and .
When far away from the saturation (peak) point,
is adjusted to large values to quickly move
towards the peak point.
When approaching the saturation (peak) point,
is adjusted to small values to slowly converge
to and remain at the peak point.
Zone I (loss rate 0)
Zone III (loss rate 37.33)
Slow Start
Zone II (loss rate 3.33)
47Experimental Results on link between ozy4 (ORNL)
and robot (LSU) - RUNAT transport performance
during transfer of 2GB data with concurrent TCP
transfer of 50MB data
Case 1 run RUNAT TCP concurrently
RUNAT throughput 10.49Mbps
Note The low throughputs were due to the high
traffic volume at the time of experiments. In a
normal day with regular traffic volume, TCP is
able to achieve 36Mbps and RUNAT may reach
1530Mbps at lower loss rates without
significantly affecting concurrent TCP on this
link.
TCP throughput 0.376Mbps
Case 2 run a single TCP only
Single TCP throughput 0.377Mbps
48Experimental Resultson link from ozy4 (ORNL) to
orbitty (NC State)
49ORNL-Atlanta-ORNL 1Gbps Channel
Juniper M160 Router at ORNL
Juniper M160 Router at Atlanta
GigE
Dell Dual Xeon 3.2GHz
OC192 ORNL-ATL
SONET blade
GigE blade
SONET blade
IP loop
GigE
Dual Opteron 2.2 GHz
- Host to Router
- Dedicated 1GigE NIC
- ORNL Router
- Filter-based forwarding to override both at input
and middle queues and disable other traffic to
GigE interfaces - IP packets on both GigE interfaces are forwarded
to out-going SONET port - Atlanta-SOX router
- Default IP loopback
- Only 1Gbps on OC192 link is used for production
traffic 9Gbps spare capacity
501Gbps Dedicated IP Channel
Juniper M160 Router at ORNL
Juniper M160 Router at Atlanta
GigE
Dell Dual Xeon 3.2GHz
OC192 ORNL-ATL
SONET blade
GigE blade
SONET blade
IP loopback
GigE
Dual Opteron 2.2 GHz
- Non-Uniform Physical Channel
- GigE SONET GigE
- 500 network miles
- End-to-End IP Path
- Both GigE links are dedicated to the channel
- Other host traffic is handled through second NIC
- Routers, OC192 and hosts are lightly loaded
- IP-based Applications and Protocols are readily
executed
51Dedicated Hosts
- Hosts
- Linux 2.4 kernel (Redhat, Suse)
- Two NICS
- optical connection to Juniper M160 router
- copper connection Ethernet switch/router
- Disks RAID 0 dual disks (140GB SCSI)
- XFS file system
- Peak disk data rate is 1.2Gbps (IO Zone
measurements) - Disk is not a bottleneck for 1Gbps data rates
52UDP goodput and loss profile
High gooput is received at non-trivial loss
Gooput plateau 990Mbps
Non-zero and random loss rate
Point in horizontal plane
531GigE NICS Act as Rate Controllers
Data rates could exceed 1Gbps
Rate Limited 1Gbps
Host
Juniper M160
Application Buffer
Kernel buffer
GigE NIC
Rate Limited 1Gbps
- Our window-based method
- Flow rate from application to NIC is ON/OFF and
exceeds 1Gbps at times - Flow is regulated to 1Gps NIC rate matches the
link rates - This method does not work well if NIC rate is
higher than link rate or router port rate - - NIC may send at higher rate causing losses at
router port
54Best Performance of Existing Protocols
Disk-to-Disk Transfers (unet2 to
unet1) Memory-to-Memory Transfers
UDT 958Mbps Both Iperf and throughput
profiles indicated 990 Mbps levels Potentially
such rates are achievable if disk access and
protocol parameters are tuned
55Hurricane Protocol
- Composed based on principles and experiences with
UDT and SABUL - was not easy for us to figure out all tweaks for
pushing peak performance - UDP window-base flow-control
- Nothing fundamentally new but needed for fine
tuning - 990 Mbps on dedicated 1Gbps connection
disk-to-disk - No attempt for congestion control
56Hurricane Control Structure
Sender
receiver
disk
Send datagrams
Receiver buffer
datagrams
Reordering datagrams
disk
TCP
Reload lost datagrams
Group k NACKs
Different subtasks are handled by threads, which
are woken up on demand Thread invocations are
reduced by clustered NCKs instead of individual
ACKS
57Hurricane
58Adhoc Optimizations
- Manual tuning of parameters
- Wait-time parameter
- Initial value chosen from throughput profile
- Empirically, goodput is unimodel in
pairwise measurements for binary search - Group size for k for NACKs
- empirically, goodput is unimodel in k and is
tuned - Disk-specific details
- Reads done in batch no input buffer
- NAKs are handled using fseek attached to the
next batch - This tuning is not likely to be transferable to
other configurations and different host loads - More work needed automatic tuning and systematic
analysis
59Outline of Presentation
- Network Infrastructure Projects
- DOE UltraScienceNet
- NSF CHEETAH
- Dynamics and Control of Transport Protocols
- TCP AIMD Dynamics
- Analytical Results
- Experimental Results
- New Class of Protocols
- Throughput Stabilization for Control
- Transport Protocol
- Probabilistic Quickest Path Problem
- Quickest path algorithm
- Probabilistic algorithm
60Shortest Path Problem
Classical Problem Given a graph
along with distance function on edges For
path we define the
path distance delay for Compute a path with
smallest path distance from source node to
destination node Solved using Dijkstras
Algorithm with complexity
61Quickest Path Problem
T(60)32
T(60)52
5,20
5,20
Problem Given a graph along
with 1. delay function on edges 2.
bandwidth function on edges For path
we define the total delay
for Compute a path with smallest total delay
from source node to destination node Solved
using Chen and Chins Algorithm with
complexity Important Observation Subpath of a
quickest path is not necessarily quickest
s
d
15,5
T(60)57
15,20
T(60)29
62Quickest Path Algorithm Chin and Chen
Let denote distinct
bandwidths Let subnetwork - edges with
bandwidth smaller than b are removed
path with least delay in Quickest
path is given by Typically implemented using m
invocations of Dijkstra algorithm m could be
quite large
63Simple Probabilistic Quickest Path Algorithm
Randomly choose a fraction of s and compute
only on
For larger networks we only needed less than 10
shortest delay computations Question Is there a
fundamental reason for this?
64Analysis
Critical Observation For
delay function is
non-decreasing Its Vapnik and Chervonenkis
dimension is 1 Makes it efficient to approximate
it by random sampling
Optimal delay
Approximation based on p shortest path
computations
Linear Approximation with p points
Rao 2004, Theoretical Computer Science
65Conclusions
- TCP-AIMD Dynamics
- Analytically established chaotic dynamics
- Analyzed Internet traces combination of chaotic
and stochastic dynamics - New Classes of Protocols
- ONTCOU achieve stable target flow level
- RUNAT statistical approach to congestion control
- Based on Stochastic Approximation convergence
proof under general conditions - Experimental results are promising both on
Internet and dedicated connections
66Thank You