Title: Eclipse: an Operating System with Quality of Service support
1QoS Support in Operating Systems
Banu Özden Bell Laboratories ozden_at_research.bell
-labs.com
2Vision
- Service providers will offer storage and
computing services - through their distributed data centers
- connected with high bandwidth networks
- to globally distributed clients.
- Clients will access these services via diverse
devices and networks, e.g. - mobile devices and wireless networks,
- high-end computer systems and high bandwidth
networks. - These services will become utilities (e.g.,
storage utility, computing utility). - Eventually resources will be exchanged and traded
between geographically dispersed data centers to
address fluctuating demand.
3Eclipse/BSDan Operating System with Quality of
Service Support
Banu Özden ozden_at_research.bell-labs.com
4Motivation
- QoS support for (server) applications
- web servers
- video servers
- Isolation and differentiation of different
- entities serviced on the same platform
- applications running on the same platform
- QoS requirements
- client-based
- service-based
- content-based
5Design Goals
- QoS support in a general purpose operating system
- Remain compatible with the underlying operating
system - QoS parameters
- Isolation
- Differentiation
- Fairness
- (Cumulative) throughput
- Flexible resource management
- capable of implementing a large set of
provisioning needs - supports a large set of server applications
without imposing significant changes to their
design
6Talk Outline
- Schedulers
- Reservation File System (reservfs)
- Tagging
- Web Server Experiments
- Access Control and Profiles
- Eclipse/BSD Status
- Related Work
- Future Work
7Proportional sharing
- Generalized processor sharing (GPS)
- weight of flow i
- service received by flow i in
- set of flows
- For any flow i continuously backlogged in
- Thus, rate of flow i in is
8QoS Guarantees
- Fairness
- Throughput
- Packet delay
9Schedulers in Eclipse
- Resource characteristics differ
- Different hierarchical proportional-share
schedulers for resources - Link scheduler WF2Q
- Disk scheduler YFQ
- CPU scheduler MTR-LS
- Network input SRP
10Hierarchical GPS Example
hierarchical proportional sharing
proportional sharing
11Schedulers
- Hierarchical proportional-sharing (GPS)
- descendant queue nodes of node
n - serviced received by scheduler
node n - in
- set of immediate descendant nodes of the
parent of node n - For any node n continuously backlogged in
12Link Aggregation
- Need to incrementally scale bandwidth
- Resource aggregation is emerging as a solution
- Grouping multiple resources into a single logical
unit - QoS over such aggregated links?
13Multi-Server Model
- Multi Server Fair Queuing (MSFQ)
- A packetized algorithm for a system with N links,
each with a bandwidth of r, that approximates a
GPS system with a single link with Nr bandwidth
Reference model
Packetized scheduler
14Multi-Server Model (Contd.)
- Goals
- Guarantee bandwidth and packet delay bounds that
are independent of the number of flows - Allow flows arrive and depart dynamically
- Be work-conserving
- Algorithm
- When a server is idle, schedule the packet that
would complete transmission earliest under a
single server GPS system with a bandwidth of Nr
Sigcomm 2001
15MSFQ Preliminary Properties
Multi-Server specific properties
- Ordering a pair of packets scheduled in the
order of their GPS finishing times may complete
in reverse order - GPS busy MSFQ busy, but converse is not true
- Non-coinciding busy periods
-
- Work backlog?
16MSFQ Properties
- Maximum service discrepancy (buffer requirement)
-
- Maximum packet delay
-
- Maximum per-flow service discrepancy
-
17Schedulers (contd.)
- Disk scheduling with QoS
- tradeoffs between QoS and total disk performance
- driver queue management
- queue depth
- queue ordering
- fragmentation
- Hierarchical YFQ
- CPU scheduling with QoS
- length of cpu phases are not known a priori
- cumulative throughput
- Hierarchical MTR-LS
18Eclipses Key Elements
- Hierarchical, proportional share resource
schedulers - Reservation, reservation file system (reservfs)
- Tagging mechanism
- Access and admission control, reservation domain
19Reservations and Schedulers
- (Resource) reservations
- unit for QoS assignment
- similar to the concept of a flow in packet
scheduling - Hierarchical schedulers
- a tree with two kinds of nodes
- scheduler nodes
- queue nodes
- each node corresponds to a reservation
- Schedulers are dynamically reconfigurable
20Web Server Example
- Hosting two companies web sites, each with two
web pages
network bandwidth
company B
company A
21Reservfs
- We built the reservation file system
- to create and manipulate reservations
- to access and configure resource schedulers
22Reservfs
- Hierarchical
- Each reservation directory corresponds to a node
at a scheduler - Each resource is represented by a reservation
directory under /reserv
23Reservfs
- Two types of reservation directories
- scheduler directories
- queue directories
- Scheduler directories are hierarchically
expandable - Queue directories are not expandable
24Reservfs
- Scheduler directory
- share
- newqueue
- newreserv
- special queue q0
- Queue directory
- share
- backlog
25Reservfs
Web Server
Video Server
Application Interface
Reservation file system
Scheduler Interface
26Reservfs API
- Creation of a new queue/scheduler reservation
- fdopen(newqueue/newreserve,O_CREAT)
- fd of newly created share file
27Creating Queue Reservation
/reserv
cpu
fxp0
da0
fxp1
q0
q0
q0
r1
q0
q0
q1
open(newqueue,O_CREAT)
fd
28Creating Scheduler Reservation
/reserv
cpu
fxp0
fxp1
q0
q0
q0
r1
q0
q1
29Reservfs API
- Changing QoS parameters
- writing a weight and min value to the share file
- Getting QoS parameters
- reading the share file
- Getting/setting queue parameters
- reading/writing the backlog file
30Reservfs API
Command line output
killerbee cd /reserv killerbee ls -al total
5 dr-xr-xr-x 0 root wheel 512 Sep 15 1137
. drwxr-xr-x 20 root wheel 512 Sep 12 2154
.. dr-xr-xr-x 0 root wheel 512 Sep 15 1137
cpu dr-xr-xr-x 0 root wheel 512 Sep 15 1137
fxp0 dr-xr-xr-x 0 root wheel 512 Sep 15 1137
fxp1
killerbee cd fxp0 killerbee ls -alR total
6 dr-xr-xr-x 0 root wheel 512 Sep 15 1139
. dr-xr-xr-x 0 root wheel 512 Sep 15 1139
.. -rw------- 1 root wheel 1 Sep 15 1139
newqueue -rw------- 1 root wheel 1 Sep 15
1139 newreserv dr-xr-xr-x 0 root wheel 512
Sep 15 1139 q0 -r-------- 1 root wheel 1
Sep 15 1139 share ./q0 total 4 dr-xr-xr-x 0
root wheel 512 Sep 15 1139 . dr-xr-xr-x 0
root wheel 512 Sep 15 1139 .. -rw------- 1
root wheel 1 Sep 15 1139 backlog -rw-------
1 root wheel 1 Sep 15 1139 share
31Reservfs API
killerbee cd q1 killerbee ls -al total
4 dr-xr-xr-x 0 root wheel 512 Sep 15 1139
. dr-xr-xr-x 0 root wheel 512 Sep 15 1139
.. -rw------- 1 root wheel 1 Sep 15 1139
share -rw------- 1 root wheel 1 Sep 15 1139
backlog killerbee cat share 50
1000000 killerbee
killerbee cd r0 killerbee ls -al total
6 dr-xr-xr-x 0 root wheel 512 Sep 15 1139
. dr-xr-xr-x 0 root wheel 512 Sep 15 1139
.. -rw------- 1 root wheel 1 Sep 15 1139
newqueue -rw------- 1 root wheel 1 Sep 15
1139 newreserv dr-xr-xr-x 0 root wheel 512
Sep 15 1139 q0 -r-------- 1 root wheel 1
Sep 15 1139 share killerbee echo 50 1000000 gt
newqueue killerbee ls -al total 6 dr-xr-xr-x 0
root wheel 512 Sep 15 1139 . dr-xr-xr-x 0
root wheel 512 Sep 15 1139 .. -rw------- 1
root wheel 1 Sep 15 1139 newqueue -rw-------
1 root wheel 1 Sep 15 1139
newreserv dr-xr-xr-x 0 root wheel 512 Sep 15
1139 q0 dr-xr-xr-x 0 root wheel 512 Sep 15
1139 q1 -r-------- 1 root wheel 1 Sep 15
1139 share
32Reservfs
Web Server
Video Server
Application Interface
Reservation file system
Scheduler Interface
33Reservfs Scheduler Interface
- Schedulers registers by providing
- the following interface routines via
- reservfs_register()
- init(priv)
- create(priv, parent, type)
- start(priv, parent, type)
- delete(priv, node)
- get/set(priv, node, values, type)
34Reservfs Implementation
- Built via vnode/vfs interface
- A reserv structure represents each reservfs
file - reserv representing a directory contains a
pointer to the corresponding node at scheduler - Scheduler independent
- Implements garbage collection mechanism
35Talk Outline
- Introduction
- Schedulers
- Reservation File System (reservfs)
- Tagging
- Web Server Experiments
- Access Control and Profiles
- Eclipse/BSD Status
- Related Work
- Future Work
36Tagging
- A request arriving at a scheduler must be
associated with the appropriate reservation - Each request is tagged with a pointer to a queue
node - mbuf, buf and proc are augmented
- How is a request tagged?
37Tagging (contd.)
- For a file, its file descriptor is tagged with a
disk reservation - For a connected socket, its file descriptor is
tagged with a network reservation - For unconnected sockets, we provide a late
tagging mechanism - Each process is tagged with a cpu reservation
- We associate reservations with references to
objects
38Default List of a Process
- Default reservations of a process, one for each
resource - A list of tags (pointers to queue directories)
- Used when a tag is otherwise not specified
- Two new files are added for each process pid in
/proc/pid - /proc/pid/default to represent the default list
- /proc/pid/cdefault to represent the child
default list
39Default List of a Process (contd.)
- Reading these file returns the name of default
queue directories, e.g., - /reserv/cpu/q1
- /reserv/fxp0/r2/q1
- /reserv/da0/r1/q3
- A process, with the appropriate access rights,
can change the entries of default files
40Implicit Tagging
- The file descriptor returned by open(), accept()
or connect() is automatically tagged with
default - The tag of the file descriptor of an unconnected
socket is set to default at sendto() and
sendmesg() - When a process forks, the child process is tagged
with the default cpu reservation
41Explicit Tagging
- The tag of a file descriptor can be set/read with
new commands to fcntl() - F_SET_RES
- F_GET_RES
- A new system call chcpures() to change the cpu
reservation of a process
42Reservation Domains
- Permissions of a process to use, create and
manipulate reservations - The reservation domain of a process is
independent of its protection domain
43Reservations and Reservation Domains
Reservation domain
1
Reservation domain 2
44Reservfs Garbage Collection
- Based on reference counts
- every application that is using a specific node
adds a reference on it (to the vnode) - Triggered by the vnode layer
- when the last application finishes using the node
this is garbage collected - fcntl() available to maintain the node even if no
references to it exist
45SRP Input Processing
- Demultiples incoming packets
- before network and higher-level protocol
processing - Unprocesed input queue per socket
- Processes input protocols in context of receiving
process - Drops packets when per-socket queue is full
- Avoids receive livelock
46Talk Outline
- Introduction
- Schedulers
- Reservation File System (reservfs)
- Tagging
- Web Server Experiments
- Access Control and Profiles
- Eclipse/BSD Status
- Related Work
- Future Work
47QoS Support for Web Server
- Virtual hosting with Apache server
- separate Apache server for each virtual host
- single Apache server for all virtual hosts
- Eclipse/BSD isolates and differentiates
performance of virtual hosts - multiple Apache servers----implicit tagging
- single Apache server----explicit tagging
- We implemented an Apache module for explicit
tagging
48Experimental Setup
- Apache Web Server
- A multi-process server
- (Pre)spawns helper processes
- A process handles one request at a time
- Each process calls accept() to service the next
connection request - HTTP clients run on five different machines
- Servers are running FreeBSD 2.2.8 or Eclipse/BSD
2.2.8 on a PC (266 MHz Pentium Pro, 64 MB RAM, 9
GB Seagate ST39173W fast wide SCSI disk) - Machines are connected with a 10/100 Mbps
Ethernet switch
49Experiments
- Hosting two sites with two servers
Reservation domain of server 1
Reservation domain of server 2
50CPU Intensive Workload
51CPU Intensive Workload
52Network Intensive Workload
53Disk Intensive Workload
54Input Intensive Workload
55Input Intensive Workload
56Experiments
- Hosting virtual hosts with a single Apache server
- Four web sites
57Apache Module for Tagging
- Apache code not modified module added
- Apache config defines which reservation to use
based on a rule, e.g., - directory-based
- port-based
- Module uses fcntl() and chcpures() for explicit
tagging
58Isolating Web Sites
Eclipse/BSD
59Isolating Web Sites
FreeBSD
60Talk Outline
- Introduction
- Reservation File System (reservfs)
- Tagging
- Schedulers
- Apache Web Server Experiments
- Access Control and Profiles
- Eclipse/BSD Status
- Related Work
- Future Work
61Access Control
- Permissions of a process to use or modify the
objects belonging to the reservfs - Currently, a process can use/modify reservations
below its default list - Soon, Eclipse/BSD will have more sophisticated
access control - process can have different permissions on a
reservation (e.g., permission for tagging but
not for modifying) - process can have permission on arbitrary set of
reservations
62Multiple Default Lists Profiles
- Multiple default lists (profiles) simplifies
explicit tagging - Server applications typically serve different
entities (depending on client, content, etc.)
with different QoS assignments - Global list of system-wide profiles
- Profiles provide an easy way to manage and share
default reservations of different entities
63Talk Outline
- Introduction
- Reservation File System (reservfs)
- Tagging
- Schedulers
- Apache Web Server Experiments
- Access Control and Profiles
- Eclipse/BSD Status
- Related Work
- Future Work
64Eclipse/BSD Status
- Derived from FreeBSD
- 3.2
- 2.2.8
- FreeBSD compatible
- Eclipse/BSD code is available at
http//www.bell-labs.com/project/eclipse
including - reservfs
- hierarchical network scheduling
- hierarchical disk scheduling
- hierarchical cpu scheduling
- input scheduling
- also, Apache module for tagging and other
applications
65Related Work
- ALTQ
- good for routers
- not sufficient for QoS support in a
general-purpose OS - Resource Containers
- different from Reservation Domains
- limited (similar to our Profiles)
- not flexible enough to specify a number of useful
provisioning needs
66Future work
- QoS on cluster of servers
- Support for fine-grained automatic tagging
- More server applications
- Supporting other QoS parameters
- Other schedulers
67Eclipse/BSDan Operating System with Quality of
Service Support
Banu Özden ozden_at_research.bell-labs.com