Title: Building Network-Centric Systems Liviu Iftode
1Building Network-Centric SystemsLiviu Iftode
2Before WWW, people were happy...
E-mail, Telnet
TCP/IP
Emacs
NFS
CS.umd.EDU
CS.rutgers.EDU
TCP/IP
- Mostly local computing
- Occasional TCP/IP networking with low
expectations and mostly non-interactive traffic - local area networks file server (NFS)
- wide area networks -Internet- E-mail, Telnet,
Ftp - Networking was not a major concern for the OS
3One Exception Cluster Computing
Multicomputers
Clusters of computers
- Cost-effective solution for high-performance
distributed computing - TCP/IP networking was the headache
- large software overheads
- Software DSM not a network-centric system -(
4The Great WWW Challenge
Web Browsing
http//www.Bank.com
TCP/IP
Bank.com
- World Wide Web made access over the Internet easy
- Internet became commercial
- Dramatic increase of interactive traffic
- WWW networking creates a network-centric system
Internet server - performance service more network clients
- availability be accessible all the time over the
network - security protect resources against network
attacks
5Network-Centric Systems
- Networking dominates the operating system
- Mobile Systems
- mobility aware TCP/IP (Mobile IP, I-TCP, etc),
disconnected file systems (Coda),
adaptation-aware applications for
mobility(Odyssey), etc - Internet Servers
- resource allocation (Lazy Receive Processing,
Resource Containers), OS shortcuts (Scout,
IO-Lite), etc - Pervasive/Ubiquitous Systems
- Tiny OS , sensor networks (Directed Diffusion,
etc), programmability (One World, etc) - Storage Networking
- network-attached storage (NASD, etc),
peer-to-peer systems (Oceanstore, etc), secure
file systems (SFS, Farsite), etc
6Big Picture
- Research sparked by various OS-Networking
tensions - Shift of focus from Performance to Availability
and Manageability - Networking and Storage I/O Convergence
- Server-based and serverless systems
- TCP/IP and non-TCP/IP protocols
- Local area, wide-area, ad-hoc and
application/overlay networks - Significant interest from industry
7Outline
- TCP Servers
- Migratory-TCP and Service Continuations
- Cooperative Computing, Smart Messages and Spatial
Programming - Federated File Systems
- Talk Highlights and Conclusions
8Problem 1 TCP/IP is too Expensive
Breakdown of the CPU time for Apache
(uniprocessor based Web-server)
9Traditional Send/Receive Communication
App
OS
App
OS
NIC
NIC
send(a)
copy(a,send_buf)
DMA(send_buf,NIC)
send_buf is transferred
interrupt
DMA(NIC,recv_buf)
copy(recv_buf,b)
receive(b)
sender
receiver
10A Closer Look
11Multiprocessor Server Performance Does not
Scale
Dual Processor
Uniprocessor
- Offered load (connections/s)
Apache Web server 1.3.20 on 1 Way and 2 Way
300MHz Pentium II SMP with repeatedly accessing
a static16 KB file
12TCP/IP-Application Co-Habitation
- TCP/IP steals compute cycles and memory from
applications - TCP/IP executes in kernel-mode mode switching
overhead - TCP/IP executes asynchronously
- interrupt processing overhead
- internal synchronization on multiprocessor
servers causes execution serialization - Cache pollution
- Hidden Service-work
- TCP packet retransmission
- TCP ACK processing
- ARP request service
- Extreme cases can compromise server performance
- Receive livelocks
- Denial-of-service (DoS) attacks
13Two Solutions
- Replace TCP/IP with a lightweight transport
protocol - Offload some/all of the TCP from host to a
dedicated computing unit (processor, computer or
intelligent network interface) - Industry high-performance, expensive solutions
- Memory-to-Memory Communication InfiniBand
- Intelligent network interface TCP Offloading
Engine(TOE) - Cost-effective and flexible solutions TCP Servers
14Memory-to-Memory(M-M) Communication
Sender
Receiver
Send
Receive
Application
TCP/IP
OS
Network Interface (NIC)
Memory Buffer
Remote DMA
M-M
OS
OS
NIC
NIC
15Memory-to-Memory Communication is Non-Intrusive
App
App
NIC
NIC
RDMA_Write(a,b)
a transferred into b
b is updated
Sender low overhead
Receiver zero overhead
16TCP Server at a Glance
- A software offloading architecture using existing
hardware - Basic idea Dedicate one or more computing units
exclusively for TCP/IP - Compared to TOE
- track technology better latest processors
- flexible adapt to changing load conditions
- cost-effective no extra hardware
- Isolate application computation from network
processing - Eliminate network interrupts and context switches
- Efficient resource allocation
- Additional performance gains (zero-copy) with
extended socket API - Related work
- Very preliminary offloading solutions Piglet,
CSP - Socket Direct Protocol, Zero-copy TCP
17Two TCP Server Architectures
- TCP Servers for Multiprocessor Servers
TCP-Server
Server Appl
TCP/IP
CPU
CPU
Shared Memory
- TCP Servers for Cluster-based Servers
TCP/IP
M-M
TCP-Server
Server Appl
18Where to Split TCP/IP Processing? (How much to
offload?)
APPLICATION
Application Processors
SYSTEM CALLS
SEND copy_from_application_buffers TCP_send IP_
send packet_scheduler setup_DMA packet_out
RECEIVE copy_to_application_buffers TCP_receive
IP_receive software_interrupt_handler interrupt
_handler packet_in
TCP Servers
19Evaluation Testbed
- Multiprocessor Server
- 4-Way 550MHz Intel Pentium II system running
Apache 1.3.20 web server on Linux 2.4.9 - NIC 3-Com 996-BT Gigabit Ethernet
- Used sclients as a client program Banga 97
20Comparative Throughput
Clients issue file requests according to a web
server trace
21Adaptive TCP Servers
- Static TCP Server configuration
- Too few TCP Servers can lead to network
processing becoming the bottleneck - Too many TCP Servers lead to degradation in
performance of CPU intensive applications - Dynamic TCP Server configuration
- Monitor the TCP Server queue lengths and system
load - Dynamically add or remove TCP Server processors
22Next Target The Storage Networking
- Storage Networking dilemma
TCP or not TCP?
M-M Communication (InfiniBand)
TCP Offloading
iSCSI (SCSI over IP)
DAFS (Direct Access File Systems)
- non-TCP/IP solutions require new wiring or
tunneling over IP-based Ethernet networks - TCP/IP solutions require TCP offloading
23Future Work TCP Servers iSCSI
TCP-Server iSCSI
Server Appl
SCSI Storage
iSCSI
CPU
CPU
TCP/IP
Shared Memory
- Use TCP-Servers to connect to SCSI storage using
iSCSI protocol over TCP/IP networks
24Problem 2 TCP/IP is too Rigid
- Server vs. Service Availability
- client interested in Service availability
- Adverse conditions may affect service
availability - internetwork congestion or failure
- servers overloaded, failed or under DoS attack
- TCP has one response
- network delays gt packet loss gt retransmission
- TCP limits the OS solutions for service
availability - early binding of service to a server
- client cannot switch to another server for
sustained service after the connection is
established
25 Service Availability through Migration
Server 1
Client
Server 2
26Migratory TCP at a Glance
- Migratory TCP migrates live connections among
cooperative servers - Migration mechanism is generic (not application
specific) lightweight (fine-grained migration)
and low-latency - Migration triggered by client or server
- Servers can be geographically distributed
(different IP addresses) - Requires changes to the server application
- Totally transparent to the client application
- Interoperates with existing TCP
- Migration policies decoupled from migration
mechanism
27Basic Idea Fine-Grained State Migration
Server1 Process
Application state
Connection state
C2
Client
C1 C2 C3 C4
C5 C6
Server2 Process
28Migratory-TCP (Lazy) Protocol
Server 1
Connect (0)
Client
lt State Replygt (3)
lt State Requestgt (2)
C
Migration Request (1)
Migration Accept(4)
Server 2
29Non-Intrusive Migration
- Migrate state without involving old-server
application (only old server OS) - Old server exports per-connection state
periodically - Connection state and Application state can go out
of sync - Upon migration, new server imports the last
exported state of the migrated connection - OS uses connection state to synchronize with
application - Non-intrusive migration with M-M communication
- uses RDMA read to extract state from the old
server with zero-overhead - works even when the old server is overloaded or
frozen
30Service Continuation (SC)
Connection state
31Related Work
- Process migration Sprite Douglis 91, Locus
Walker 83, MOSIX Barak 98, etc. - VM migration Rosemblum 02, Nieh 02
- Migration in web server clusters Snoeren 00,
Luo 01 - Fault-tolerant TCP Alvisi 00
- TCP extensions for host mobility I-TCP Bakre
95, Snoop TCP Balakrishnan 95, end-to-end
approaches Snoeren 00, Msocks Maltz 98 - SCTP (RFC 2960)
32Evaluation
- Implemented SC and M-TCP in FreeBSD kernel
- Integrated SC in real Internet servers
- web, media streaming, transactional DB
- Microbenchmark
- impact of migration on client perceived
throughput for a two-process server using TTCP - Real applications
- sustain web server throughput under load produced
by increasing the number of client connections
33Impact of Migration on Throughput
34Web Server Throughput
35Future Research Use SC to Build Self-Healing
Cluster-based Systems
36Problem 3 Computer Systems move Outdoors
Linux Car
Sensors
Linux Camera
Linux Watch
- Massive numbers of computers will be embedded
everywhere in the physical world - Dynamic ad-hoc networking
- How to execute user-defined applications over
these networks?
37Outdoor Distributed Computing
- Traditional distributed computing has been indoor
- Target performance and/or fault tolerance
- Stable configuration, robust networking (TCP/IP
or M-M) - Relatively small scale
- Functionally equivalent nodes
- Message passing or shared memory programming
- Outdoor Distributed Computing
- Target Collect/Disseminate distributed data
and/or perform collective tasks - Volatile nodes and links
- Node equivalence determined by their physical
properties (content-based naming) - Data migration is not good
- expensive to perform end-to-end transfer control
- too rigid for such a dynamic network
38Cooperative Computing at a Glance
- Distributed computing with execution migration
- Smart Message carries the execution state (and
possibly the code) in addition to the payload - execution state assumed to be small (explicit
migration) - code usually cached (few applications)
- Nodes cooperate by allowing Smart Messages
- to execute on them
- to use their memory to store persistent data
(tags) - Nodes do not provide routing
- Smart Message executes on each node of its path
- Application executed on target nodes (nodes of
interest) - Routing executed on each node of the path
(self-routing) - During its lifetime, an application generates at
least one, possibly multiple, smart messages
39Smart vs. Dumb Messages
Marys lunch Appetizer Entree Dessert
Data migration
40Smart Messages
Hot
Hot
Hot
41Cooperative Node Architecure
Virtual Machine
SM Arrival
SM Migration
Admission Manager
Scheduling
Tag Space
OS I/O
- Admission control for resource security
- Non-preemptive scheduling with timeout-kill
- Tags created by SMs (limited lifetime) or I/O
tags (permanent) - global tag name space hash(SM code), tag name
- five protection domains defined using hash(SM
code), SM source node ID, and SM starting time.
42Related Work
- Mobile agents (DAgents, Ajanta)
- Active networks (ANTS, SNAP)
- Sensor networks (Diffusion, TinyOS, TAG)
- Pervasive computing (One.world)
43Prototype Implementation
- 8 HP iPAQs running Linux
- 802.11 wireless communication
- Sun Java K Virtual Machine
- Geographic (simplified GPSR) and On-Demand (AODV)
routing
user node
intermediate node
node of interest
Routing algorithm
Code not cached (ms)
Code cached (ms)
Geographic (GPSR)
415.6
126.6
On-demand (AODV)
506.6
314.7
Completion Time
44Self-Routing
- There is no best routing outdoors
- Depends on application and node property dynamics
- Application-controlled routing
- Possible with Smart Messages (execution state
carried in the message) - When migration times out, the application is
upcalled on the current node to decide what to do
next
45Self-Routing Effectiveness (simulation)
- geographical routing to reach target regions
- on-demand routing within region
- application decides when to switch between the
two
starting node
node of interest
other node
46Next Target Spatial Programming
- Smart Message too low-level programming
- How to describe distributed computing over
dynamic outdoor networks of embedded systems
with limited knowledge about resource number,
location, etc - Spatial Programming (SP) design guidelines
- space is a first-order programming concept
- resources named by their expected location and
properties (spatial reference) - reference consistency spatial reference-to-
resource mappings are consistent throughout the
program - program must tolerate resource dynamics
- SP can be implemented using Smart Messages (the
spatial reference mapping table carried as
payload)
47Spatial Programming Example
Mobile sprinklers with temperature sensors
Right Hill
Left Hill
Hot spot
- Program sprinklers to water the hottest spot of
the Left Hill
48 Problem 4 Manageable Distributed File
Systems
- Most distributed file servers use TCP/IP both for
client-server and intra-server communication - Strong file consistency, file locking and load
balancing difficult to provide - File servers require significant human effort to
manage add storage, move directories, etc - Cluster-based file servers are cost-effective
- Scalable performance requires load balancing
- Load balancing may require file migration
- File migration limited if file naming is
location-dependent - We need a scalable, location-independent and easy
to manage cluster-based distributed file system
49Federated File System at a Glance
A2
A2
A3
A3
A3
A1
FedFS
FedFS
Local FS
Local FS
Local FS
Local FS
M-M Interconnect
- Global file name space over cluster of autonomous
local file systems interconnected by a M-M network
50Location Independent Global File Naming
- Virtual Directory (VD) union of local
directories - volatile, created on demand (dirmerge)
- contains information about files including
location (homes of files) - assigned dynamically to nodes (managers)
- supports location independent file naming and
file migration - Directory Tables (DT) local caches of VD
entries (TLB)
usr
virtual directory
file1
file2
local directories
usr
usr
file1
file2
Local file system 1
Local file system 2
51Direct Access File System (DAFS)
52Federated DAFS
Distributed NFS over FedFS
NFS Server
FedFS
Local FS
M-M
TCP/IP
TCP/IP
Application
NFS Client
TCP/IP
TCP/IP
M-M
53Related Work
- Cluster-based File Systems
- FrangipaniThekkath97, PVFS Carns00,GFS,
Archipelago JI00, Trapeze (Duke) - DAFS NetApp03,Magoutis01,02,03
- User-level communication in cluster-based network
servers Carrera02
54Experimental Platform
- Eight node server cluster
- 800 MHz PIII, 512 MB SDRAM, 9 GB 10K RPM SCSI
- Client
- Dual processor (300 MHz PII), 512 MB SDRAM
- Linux-2.4
- Servers and Clients equipped with Emulex cLAN
adapter (M-M network)
55Workload I
- Postmark Synthetic benchmark
- Short-lived small files
- Mix of metadata-intensive operations
- Postmark outline
- Create a pool of files
- Perform transactions READ/WRITE paired with
CREATE/DELETE - Delete created files
- Each Postmark client performs 30,000 transactions
- Clients distribute requests to servers using a
hash function on pathnames - Files are physically placed on the node which
receives client requests
56Postmark Throughput
- Postmark Throughput (txns/sec)
57Workload II
- Postmark performs only READ transactions
- No create/delete operations
- Federated DAFS does not control file placement
- No client request sent to files correct location
58Postmark Read Throughput
- Postmark Read Throughput (txns/sec)
59Next Target Federated DAFS over the Internet
DAFS Server
FedFS
Local FS
M-M
TCP/IP
DAFS Server
Application
Internet
DAFS Client
FedFS
Local FS
M-M
M-M
Application
DAFS Client
M-M
DAFS Server
FedFS
Local FS
M-M
60Outline
- TCP Servers
- Migratory-TCP and Service Continuations
- Cooperative Computing, Smart Messages and Spatial
Programming - Federated File Systems
- Talk Highlights and Conclusions
61Talk Highlights
- Back to Migration
- Service Continuation service availability and
self-healing clusters - Smart Messages programming dynamic networks of
embedded systems - Exploit Non-Intrusive M-M Communication
- TCP offloading
- State migration
- Federated file systems
- Network and Storage I/O Convergence
- TCP Servers iSCSI
- Federated File Systems M-M
- Programmability
- Smart Messages and Spatial Programming
- Extended Server API Service Continuation, TCP
Servers, Federated file system
62Conclusions
- Network-Centric Systems very promising
border-crossing systems research area - Common issues for a large spectrum of systems and
networks - Tremendous potential to impact industry
63Aknowledgements
- UMD students Andrzej Kochut, Chunyuan Liao,
Tamer Nadeem, Iulian Neamtiu and Jihwang Yeo. - Rutgers students Ashok Arumugam, Kalpana
Banerjee, Aniruddha Bohra, Cristian Borcea,
Suresh Gopalakrisnan, Deepa Iyer, Porlin Kang,
Vivek Pathak, Murali Rangarajan, Rabita Sarker,
Akhilesh Saxena, Steve Smaldone, Kiran
Srinivasan, Florin Sultan and Gang Xu. - Post-doc Chalermek Intanagonwiwat
- Collaborations at Rutgers EEL (Ulrich Kremer),
DARK (Ricardo Bianchini), PANIC (Rich Martin and
Thu Nguyen) - Support NSF ITR ANI-0121416 and CAREER
CCR-013366