Title: Live Updating Operating Systems Using Virtualization
1Live Updating Operating Systems Using
Virtualization
- Haibo Chen, Rong Chen, Fengzhe Zhang, Binyu Zang
- Fudan University
- Pen-Chung Yew
- University of Minnesota at Twin-Cities
2Motivation
- Operating Systems are far from perfect
- Security holes, design flaws, bugs, new features
- Results continuous patches and upgrades required
- Difficulties in applying patches and upgrades
- Disruptive loss of availability
- Irreversible risk of system crash
- Live Update feature is highly desirable, and very
often, critical.
3What COS misses?
- Requirements to Live Update an OS
- Define an updatable unit
- Difficult, COS is monolithic
- Apply patch in a safe point
- Some hot spots do not have a safe point
- root file system, network modules
- Consistency
- Difficult for OS to update itself
4What is LUCOS?
- Any problem in computer science can be solved
with another level of indirection. - David Wheeler in Butler Lampsons 1992 ACM Turing
Award speech. - Live Updating Contemporary Operating Systems
using virtualization - Use Virtual Machine Monitors (VMMs) to patch
operating systems (e.g. Linux) - Avoid need for safe point, allow co-existence of
the old version and the new version of data
structures. - VMM maintains the coherence and tracks when to
finish a live update.
5What is LUCOS?
- A practical live updating system
- Apply a broaden range of real-life Linux patches
on-the-fly - require no safe points, retain OS-transparency.
- Support patches for recovering tainted state
(e.g. deadlock situation) - Allow rolling back committed patches
- Require minimal update time(lt 1ms) and incur
negligible performance overhead (less than 1)
6Some Existing Efforts
- Dynamic Software Update
- Focus on live update to application software
- LUCOS live update to operating systems
- K42 (Baumann et al., Usenix 05)
- A new operating system to support live update
- Tightly bound to object-oriented design
techniques - A safe point is desirable
- LUCOS transparently supports existing OS
(including non-object-oriented), requires no safe
point
7LUCOS Architecture
8Two Types of Live Updates
- Updates to only code
- Only code is modified.
- Updates to code with data changes
- Including global, single-instance data, or
multiple-instance data.
9Live Update to Code Only
10Live Update to Code with Data Changes
11Termination of a Live Update
- When all threads leave original functions
- Stack inspection (Altekar, Usenix Security05)
- Maintain a list of threads executing in original
functions - Remove threads that leave original functions
- Terminate live update when the list is empty
12Patches for Recovering Tainted State
- Vision
- Some bugs could cause a tainted state
- Deadlock situation
- Simple patching could not solve the problem
spinlock_t demo_lock SPIN_LOCK_UNLOCKED void
foo(void)... spin_lock(demo_lock) ...
if(condition)return ... spin_unlock(demo_lo
ck) Code 1. a buggy function with a
potential for deadlocks.
spinlock_t demo_lock SPIN_LOCK_UNLOCKED void
foo_patch(void)... spin_lock(demo_lock) ... i
f(condition) spin_unlock(demo_lock) return ..
. spin_unlock(demo_lock) code 2 a patch
function to fix the deadlock problem.
void state_transfer(void) if(spin_is_locked( dem
o_lock)) spin_unlock(demo_lock) code 3
a callback function to recover from a deadlocked
situation.
13Patches for Recovering Tainted State
- Solutions
- Allow callbacks in live update
- Three types of callbacks in LUCOS
- function callbacks
- thread callbacks
- data callbacks
- Example use thread callbacks to resolve the
deadlock situation
14Patch Rollback
- A special type of patches
- Use the original code and data to patch the
committed ones - Change state with new data back to original data
- Resource overhead
- Has to keep original code and data in memory
15Experiments Setup
- Implemented on Linux 2.6.10 running Xen-2.0.5.
- Systems
- Fedora Core 2 distribution
- 3.0GHz Pentium IV with 1GB RAM
- Intel Pro 100/1000 Ethernet NIC in 100Mbs LAN
- A single 250GB 7200 RPM SATA disk.
16Workloads
- SPEC INT 2000
- Measure the performance of CPU-intensive
workloads - Linux build time
- Measure the overall time to built a Linux Kernel
2.6.10 with gcc-3.3.3. - Open Source Database Benchmark suite (OSDB)
- Information Retrieval (IR)
- Online Transaction Processing (OLTP)
17Experience with Real-Life Patches
- Five typical patches selected from Linux
upgrades - upgrade of Linux kernel from 2.6.10 to 2.6.11
- upgrade of backend block device drivers in
Xen-Linux
No. Patch type Description
1 Type 1 Fixing the page reading bug
2 Type 1 Removal of livelock avoidance
3 Type 2 Upgrading the process scheduler
4 Type 2 Reconstruction of the IRQ descriptors
5 Type 2 Upgrading backend block device drivers in Xen-Linux
18Time to Apply and Rollback Live Updates
Note OSDB-IR/OLTP are running in background when
the patches are applied and rollbacked.
19Relative Performance (Normal Execution)
20Conclusions
- Existing operating systems can be live updated
- No safe point is required
- Patches should recover tainted state
- Rollback of a live update is supported
- Time overhead to apply a live update is minimal
- Performance overhead is negligible
21Future Work
- Avoid the performance overhead of virtualization
- Integrate it with our self-virtualization system
- Virtualize operating systems on demand
22Questions?
- Our contact information
- Parallel processing institute, Fudan University,
China - Phone 86-21-51355363
- Fax 86-21-65646571
23(No Transcript)
24Patch File Format in LUCOS
- Follows the format of Linux kernel modules, and
adds - New declarations of data structures
- Callback functions
- Patch startup and patch cleanup functions
- State transfer
25Fine-grained memory protection
- Facilitating ECC memory (Qin et al., HPCA05)
- cache line granularity
- Mondrian memory protection (Witchel et al.,
ASPLOS-X) - word level memory protection
26Self-virtualization architecture
- OS can switch between the three modes
on-the-fly quickly - Applications are completely unaware of the
mode switch - Hosting mode is used to host other OS .
- Migrating mode prepares the OS to self-migrate
to other machine.